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1.0  INTRODUCTION 


This  document  is  the  Final  Scientific  and  Technical  Report  (Contract  Data  Requirements  List 
[CDRL]  A004)  for  contract  No.  DAAH01-99-C-R129  entitled,  “Real-Time  Network 
Management.”  This  is  the  final  technical  report  for  a  31 -month  effort  (30  months  research  and  1 
month  for  the  final  technical  report),  which  ran  from  14  April  1999  to  12  November  2001.  BAE 
SYSTEMS  Portal  Solutions  Inc.  (PSI),  formerly  Synectics  Corporation,  was  the  prime  contractor 
with  the  State  University  of  New  York  Institute  of  Technology  (SUNY  IT)  as  a  subcontractor  on 
the  effort. 

In  response  to  an  invitation  from  Hillarie  Ormon,  the  Phase  I  Defense  Advanced  Research 
Projects  Agency  (DARPA)  sponsor  of  the  Real-Time  Network  Management  Small  Business 
Innovative  Research  (SBIR)  effort,  Synectics  Corporation  submitted  a  proposal  for  Phase  II. 
The  Phase  11  SBIR  contract  was  awarded  on  14  April  1999  with  the  objective  of  developing 
methods  for  diagnosing  network  performance  problems  in  real  time  and  expanding  upon  the 
work  accomplished  in  Phase  I. 


2.0  OBJECTIVES 


The  objective  of  this  effort  was  to  develop  methods  for  diagnosing  network  performance 
problems  in  real  time.  According  to  our  Phase  I  research,  it  is  possible  to  collect  data  on  the 
network  and  morph  it  into  queuing  models  to  produce  information  about  the  network  and 
physical  layers  of  nodes  on  a  network.  The  Phase  n  research  expanded  upon  the  work 
accomplished  in  Phase  I. 

The  program  consisted  of  four  tasks.  The  first  task  entailed  the  research  and  categorization  of 
models  that  provided  both  a  sufficiently  accurate  description  of  the  behavior  of  the  computer 
network  and  were  computationally  tractable  for  incorporation  into  a  real-time  network 
monitoring  system.  Secondly,  PSI  researched  mechanisms  for  achieving  the  real-time 
determination  of  the  parameters  of  the  current-to-be  models  of  a  network,  for  monitoring  k-step 
ahead  forecasting  accuracy  of  the  current  models,  and  for  replacing  the  latter  in  case  of 
breakdown.  Thirdly,  PSI  built  an  object  model  of  the  analytical  and  operational  entities  that 
make  up  the  architecture.  Finally,  PSI  developed  the  application  based  on  the  research  and 
design  of  the  previous  tasks  using  Java,  with  support  from  JMX  (Java  Management  Extensions), 
JDBC  (Java  Data  Base  Connectivity),  and  the  object  model  designed  in  the  second  task. 
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2.1  TASK  1  -  MATHEMATICAL  MODELING 


2.1 .1  Network  Traffic  Model  Research 

The  range  of  computer  network  traffic  models  spans  a  rather  wide  spectrum,  from  the  simple 
Markovian  families  at  one  end,  to  the  newer  self-similar  long-range  dependent  varieties  at  the 
other  end.  It  now  appears  that  the  accuracy  of  a  model  is  fundamentally  tied  to  the  time  scale  on 
which  the  model  is  exercised.  It  has  been  demonstrated  in  the  research  literature  that  Markovian 
models  are  not  accurate  enough  for  large  time  scales,  and  that  they  therefore  underestimate  the 
resources  required  by  the  network  over  such  time  scales.  However,  the  lower  bound  of  a  large 
time  scale  is  not  exactly  clear;  it  is  suspected  to  be  in  the  minutes.  On  the  other  hand,  models 
with  long-range  dependence  should  definitely  not  be  applied  to  shorter  time  scales.  Their 
computational  requirements  will  exclude  them  anyway  for  real-time  monitoring  purposes. 

A  computationally  efficient  and  theoretically  tractable  compromise  between  the  two  extremes  is 
the  locally  stationary  models.  By  this  we  mean  concatenations  of  stationary  but  not  necessarily 
Markovian  models.  In  this  approach,  a  fundamental  problem  centered  on  the  identification  of 
model  change  points.  Indeed,  it  was  expected  (and  confirmed  in  much  of  the  computation  we 
have  carried  out  so  far)  that  traffic  may  be  subdivided  into  short  epochs  of  just  a  few  minutes 
duration,  that  may  be  assumed  stationary.  The  problem  of  identifying  the  beginning  of  a  new 
epoch  is  addressed  in  Sections  2.1.2  and  2.1.3.  We  modeled  each  epoch  as  a  multivariate 
(vector)  AR  model,  and  recomputed  the  parameters  of  such  a  model  for  each  epoch  of  the  traffic. 
A  stationary  traffic  AR  model,  where  the  traffic  is  specified  as  an  m-dimensional  vector  time 
series  {Xt}  (the  m  components  correspond,  for  example,  to  m  SNMP  variables  of  interest),  is 
given  by  the  equation 

X,  =  W  +  +  Et  (1) 

i=l 

where  p  is  the  model  order,  the  m-by-m  matrices  A,  are  the  p  coefficients  matrices  of  the  model, 
W  is  the  m-hy-1  intercept  vector,  and  €,  is  an  m-by-i  random  noise  vector.  We  assumed  that  e, 
has  zero  mean  with  a  covariance  m-by-m  matrix  C.  Xt  itself  is,  of  course,  an  m-by-i  vector.  The 
set 


{p.Ai,W,C}  (2) 

will  be  referred  to  as  the  parameters  of  the  model. 

The  modeling  steps  of  both  off-line  analysis  and  real-time  monitoring  consisted  of  the  following 
tasks. 


□  Identifying  the  next  model  change  point 

□  For  the  next  epoch 

♦  selecting  an  optimal  model  order  p  (in  this  work,  we  used  a  kind  of  Schwarz 
Bayesian  Criterion  statistic) 
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♦  estimating  the  parameters  {  p,  A,  ,  W,  C  }  of  the  model,  and  their  confidence 
intervals  (we  used  mostly  least-squares  techniques) 

♦  performing  a  diagnostic  checking  (validation  against  observed  data)  of  the  fitted 
model  (we  used  statistics  like  the  Li-McLeod  portmanteau  criterion) 

♦  performing  spectral  analysis  of  the  fitted  model 

The  end  goal  of  the  modeling  process  was  a  characterization  of  the  underlying  physical  processes 
that  give  rise  to  the  observed  and  modeled  traffic.  Such  characterization  could  then  be  applied  to 
more  practical  concerns,  for  example  to: 

□  defining  a  concept  of  normal  behavior,  and  thus  characterizing  (within  some 
confidence  intervals)  that  an  observed  behavior  is  anomalous 

□  assessing  queuing  parameters  for  further  analyses  and  network  resource  requirements 

□  predicting  the  next  vector  value  X,+]  using  equation  (1)  above,  as  long  as  no  new 
epoch  has  been  signaled.  For  example,  the  expected  value  for  X,  is  given  by  the 
following  equation,  where  I  is  the  m-by-m  identity  matrix: 

E{X,)  =  (/  -  Ai  -  A2  -•••-  Ap)"V  (3) 

Consider,  for  example,  the  following  two-dimensional  data  collected  at  a  router  interface  over  a 
period  of  15  minutes.  In  the  graph  black  represents  the  ifInOctets  and  gray  the  ifOutOctets. 


Exhibit  1 .  Router  interface  Series 
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The  Cumulative  Sum  (CUSUM)  model  change  point  technique  of  Sections  2.1.2  and  2.1.3 
identifies  t  s  115  and  t  s  165  as  change  points.  Here,  we  argued  that  what  was  observed  at 
time  t  is  the  state  of  the  interface,  and  no  shift  of  the  ifOutOctets  data  was  attempted  to  match 
against  iflnOctets.  The  modeling  of  epochs 

Ej  =  {t  \  1  <t  <115  ;  and  ^2  =  / 1 1  116  <t<165  } 
produces  the  following  parameters. 


Table  1.  AR  Models  for  Epochs  £r  and  £2  of  Exhibit  1 


P 

optimal 
model  order 

Ai 

coefficient 

matrix 

W 

intercept 

vector 

C 

noise  covariance 
matrix 

El 

1 

'0.7310  -  0.0524' 

-0.0701  0.7142 

1( 

'3.3658' 

2.8888 

• 

IO12* 

'  9.8485  -0.4018' 

-0.4018  4.5499 

E2 

1 

'  0.8392  -  0.0988' 

-  0.0357  0.4788 

1( 

* 

'3.5977' 

3.6242_ 

10’^* 

'  1.8529  -0.0793' 

-0.0793  0.5571 

Note  that  modeling  an  m-dimensional  vector  time  series  is  not  equivalent  to  modeling  m  one¬ 
dimensional  time  series  separately.  For  example,  modeling  only  the  iflnOctets  series  of  Exhibit 
1,  we  found  for  the  two  epochs 

=  {t  \  l<t<115}  and  £'2'"  =  {t  \  116  <t  <165} 
the  following  parameters. 


Table  2.  AR  Models  for  Epochs  Ei"  and  £2'"  of  Exhibit  1 


P 

optimal 
model  order 

Ai 

coefficient 

matrices 

W 

intercept 

vector 

C 

noise  covariance 
matrix 

e/" 

1 

Ai  =  [0.7430] 

10^* 

[2.7861] 

IO12* 

[9.5992] 

£2'" 

3 

Ai  =  [0.4356] 

A2  =  [0.1121] 

As  =  [0.5600] 

-10^  * 

[1.1331] 

10i3  * 

[1.1677] 
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For  £■/'"  the  Li-McLeod  portmanteau  statistic  evaluates  to  0.0501,  hence  the  fitted  model  passed 
the  diagnostic  checking  defined  by  this  statistic  (the  wisdom  value  for  passing  is  0.05). 
Likewise,  £2"  passed  with  a  value  of  0.05895. 

The  A,  and  C  parameters  may  be  used  to  compute  the  spectral  decomposition  of  the  AR  model. 
In  the  case  of  order  p  =  1,  the  computation  was  straightforward.  For  example,  models  Ei  and  E2 
above  have  the  following  spectral  decomposition,  with  corresponding  confidence  intervals. 

Table  3.  Spectral  Decomposition  of  Models  E|  and  E2 


model  spectral 
basis 

confidence  intervals 
for  spectral  basis 

El 

O.bOlS' 

0.7991 

'0.5995' 

0.4511 

E2 

'0.2581' 

0.9661 

11 

In  the  case  of  order  p  >  1,  the  model  was  reinterpreted  as  an  artificial  model  of  order  1,  with 
augmented  block  p-by-p  coefficient  matrix  Ai’  whose  first  block  row  consisted  of  the  coefficient 
matrices  of  the  original  model.  The  p-1  subsequent  block  rows  were  simply  the  p-by-p  block 
identity  matrix  (each  diagonal  block  is  the  ordinary  m-by-m  identity  matrix).  Likewise,  the  state 
and  noise  vectors  used  with  this  Ai’  were  simply  mp-by-1  vectors  Xt’  and  Et’,  where  the  rows  of 
Xt’  were  Xt,  Xu, ...,  Xt.p+i  in  that  order,  and  those  of  Et’  were  Et ,  0, 0, ...,  0  in  that  order  (each  0 
is  the  m-by-1  zero  vector).  From  this  reinterpretation,  we  could,  for  example,  compute  a  spectral 
decomposition  for  the  order-3  model  Ea™  above. 

Table  4.  Spectral  Decomposition  of  Modeis  £2'” 


spectral  basis 

[0.5489] 

[-0.7378  -  0.0760  i] 

[-0.7378  +  0.0760/] 

Corresponding 
confidence  intervals 

0.0257 

0.0563 

0.0563 

The  real  challenge  lay  in  assigning  dynamical  semantics  to  the  spectral  information.  For 
example,  it  might  be  feasible  to  classify  the  eigenvalues  according  to  their  network  traffic 
importance. 


2.1 .2  Model  Change  Point  Research 

The  detection  of  model-change  points  was  paramount  to  our  appropriate  visualization  of  the 
network  as  a  coherent  system.  Suppose  at  time  t,  a  model  point  change  is  observed  (for  example, 
Ma(R\t)y\M p(R\t  +  St)  is  true).  Even  though  we  might  not  know  the  model  M p(R\t^-dtee)  ,'Ne, 
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could  assume  that  until  the  next  epoch  e  was  indicated  every  sampling  from  then  on  should 
cohere  for  the  new  model  that  had  to  be  computed  from  the  sampling  data.  In  other  words. 


T rigger  for  a  new  epoch  e 


Start  a  new  data  sampling  process  I  B 


j 


Obtain  traffic  distribution  from 
sampling  distribution 


I 


1 


Obtain  aggregate  queuing  models  1 

for  nodes  and  switches  1 

1 

f 

Update  nodes  and  links  queuing  1 

Obtain  local  and  global  predictions  1 

model  properties  1 

using  above  models  1 

In  this  section  we  outline  our  process  for  the  box  marked  C  above.  The  algorithm  is  described 
below.  On  an  epoch  e,  we  computed  the  relevant  distribution  parameters  for  the  random  variable 
X  (based  on  the  random  samples  collected  within  the  epoch)  and  inferred  the  attendant 
distribution  as  shown. 

□  Obtain  the  sample  mean  x  =  ^x„/n  on  a  sample  density  of  n  points. 

n 

□  Obtain  the  sample  variance  =  ^(Xj-xf  /(n-l) 

i=l 

□  Obtain  the  sample  standard  deviation  s  =  sj' 

□  Switch  (c2=i|-) 

♦  Case  (1):  exponential  distribution  with  a  mean  x  =  l/A„  and  a  variance  sj  =1/A^. 

♦  Case  (l/k):  Erlang  k-distribution,  it  >  1  an  integer. 

♦  Case  (>1):  a  k-stage  hyperexponential  distribution  (see  case  B  for  construction 
of  this). 

♦  Default:  a  gamma-distribution  with  parameters  a  and  A . 

The  distributions  were  indicated  next.  For  the  random  variable  X  the  pdf  was  given  by  the 
following  cases  of  interest. 

Exponential:  f(x)  =  Aexp(-Ax),  x>0,  A>0 
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=  0,  Otherwise 
x  =  l/A.  s^=l/A^ 


Erlang-k : 


f(x)=  — f -  x>0,  A>0,  k  an  integer 

■'  (k-l)! 

=  0,  otherwise 

x  =  l/A,  s^=l/(kA'^) 


Suppose,  for  our  distribution  A  =  x.  Let  k  be  the  largest  integer  less  than  or  equal  to  (x'^  / ), 
i.e.,  the  floor  of  this.  Then  this  distribution  is  the  Erlang-k  random  variable  with  parameters  k 
and  A. 

k 

k-stage  hyperexponential:  ,  x>0.  jUi>0,  k  an  integer 

i=l 

=  0,  otherwise 

wecompute  £/x7  =  x  =  ^— ,  E[x^]  =  2^^  =  E[x^ ]-E[x]^ 

,=1  1=1  f^i 

This  arose  when  packet  arrivals,  for  instance,  tended  to  form  clusters  or  service  function  was 
approximately  k-parallel  exponential  stages  of  active  servers.  A  typical  switch  could  depict  such 
a  distribution. 

gamma:  f(x)  =  — — - ,  x>0,  a>0.  A>0 

T(x) 

=  0,  otherwise  T{x)  =  ^x‘~^e~^dK,  t>0 

0 

x  =  a/A,  s^=a/??' 

The  results  of  the  AR  modeling  of  Section  2.1.1  were  effectively  used  to  compute  the  statistics 
(for  example,  mean  and  variance)  of  the  above  algorithm.  For  example,  equation  (3)  provides  an 
estimate  of  the  means  from  the  model. 


2.1 .3  Queuing  Model  Research 

For  queue  modeling,  we  looked  at  applicable  M/G/1  and  G/M/1  formulations,  and  more 
generally  at  (possibly  Markov-modulated)  G/G/1  regimes.  The  models  of  Section  2.1.1 
approximately  provided  G,  the  general  arrival  or  service  distribution.  The  main  issue  was,  of 
course,  the  (real-time)  computation  of  the  statistics  of  the  queue. 

For  M/G/1,  the  so-called  Pollaczek-Khinchin  mean-value  formula  provided  the  average  number 
of  users  in  the  system,  viz. 
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r 


P  ^ 


i  +  c^ 

2(l-p) 


with  C  being  the  coefficient  of  variation  of  the  service  time  (probability  distribution),  and  p  the 
load  (average  arrival  rate  divided  by  the  average  service  rate).  In  particular,  if  G  =  D 
(deterministic  service  time),  then  C  =  0,  so  that 


Pi2-p) 

2(l-p  ) 

Taking  a  deterministic  service  time  might  be,  in  many  cases,  a  good  approximation,  as  it  was 
quite  difficult  to  evaluate  the  exact  service  time  distribution.  The  variance  v  and  higher-order 
statistics  of  the  number  of  users  in  the  system  could  be  obtained  by  so-called  transformation 
methods.  They  involved,  however,  higher-order  moments  of  the  service  time  distribution.  We 
were  looking  at  numerical  evaluations  of  these  higher-order  statistics. 

A  G/M/1  is  the  dual  of  M/G/1,  since  the  statistics  of  the  arrival  time  distribution  account  for  the 
waiting  time  statistics.  The  duality  took  us  again  to  Laplace  transforms.  In  particular,  the 
parameters  r  and  v  involved  the  Laplace  transform  of  the  arrival  time  probability  distribution. 
Indeed,  if  ^was  the  solution  to  the  equation 

6  =  pd) 

(0is  the  probability  of  an  arrival  finding  the  queue  full),  where  A(z)  was  the  Laplace  transform 
of  the  arrival  distribution  G,  and  p  was  the  assumed  Markovian  service  rate,  then 


with  variance 

p{\-p  +  e) 

{1-0? 

The  main  issue  with  the  implementation  of  these  formulas  lay  in  the  numerical  estimation  of  9,  p 
and  p.  We  showed  next  how  to  compute  the  service  rate  and  effective  buffer  size  using  an 
iterative  algorithm. 

One  of  the  problems  that  required  immediate  attention  was  ascertaining  a  server’s  service  rate 
from  posted  data  on  traffic.  Without  this,  a  queuing  analysis  would  be  impossible;  any 
interesting  property  vector  of  a  server  had  to  include  this  as  a  parameter.  Also  of  importance  was 
the  knowledge  of  the  available  buffer  volume  at  the  server.  If  such  a  datum  were  available,  it 
would  be  easier  to  control  and  plan  its  action.  Every  server  has  to  have  some  storage  for 
incoming  packet  traffic  desiring  service.  If  the  server  is  highly  utilized,  it  is  possible  that  its 
storage  capability  is  reduced  to  naught  in  which  case  it  would  begin  to  drop  incoming  packet 
traffic  as  it  came  to  the  server  for  service. 
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Suppose  a  server  node  was  configured  with  b  MB  of  buffer  storage.  Even  though  it  was  known 
that  the  queuing  volume  of  a  server  could  not  exceed  b  MB  it  was  not  easy  to  determine  the 
effective  capacity  of  the  buffer  due  to  variations  in  packet  size.  Consider  a  queuing  system 
XTY/l/k  depicting  a  network  node  (a  router,  a  host)  offering  a  total  of  k  packet  buffers  for 
temporary  storage  of  packets.  As  long  as  packets  were  not  dropped  it  would  appear  as 


Packets 


Infinite  Buffer 


However,  if  the  same  node  began  to  drop  packets  owing  to  buffer  unavailability  (full  buffer),  it 
would  appear,  given  everything  else  as. 


Packets 

w 

Finite  Buffer 


Model:  X/Y/l/k 


Therefore,  from  a  server’s  performance,  particularly  as  it  began  to  drop  packets  from  entering 
into  buffer,  we  would  have  to  infer  the  server’s  effective  storage  capacity.  Let  us  suppose  the 
storage  capacity  was  k  packets,  k  being  the  maximum  number  of  packets  the  buffer  can  hold  at 
one  time.  For  convenience,  we  assumed  a  M/M/l/k  system.  The  probability  that  the  server’s 
buffer  was  full,  i.e.,  the  packet  dropping  probability  was 


Pk  P  drop 


.(\z£)PL 

I-P* 


(4) 


where  the  traffic  intensity  at  the  buffer  before  it  became  full  was  p  =  X/ pt.  Let  us  assume  that 
the  packet  arrival  rate  A(t )  was  known  but  ju(t},  the  rate  at  which  the  server-transmitted  packets 
onto  the  network  was  not  known  but  must  be  inferred  (the  general  case).  Suppose  at  two 
consecutive  sampling  instances,  the  server  rates  were,  on  average,  /4t)  and 
Accordingly,  Equation  (4)  could  be  expressed  as 

( 1  -  P*  =  Pt  -  Pk  (5a) 

( 1  -  P*  M+a  =  Ptk&  -  Pk  (5b) 


Therefore, 


_  Pt^St  Pk 

^k+1  ^k  _ 

Pi  Pt  -  Pk 


(6a) 


In  this,  we  were  making  the  assumption  that  the  probability  that  a  packet  coming  for  service 
would  find  the  buffer  full  remains,  on  the  first  order  approximation,  more  or  less  the  same. 
Equation  (6a)  could  be  also  be  expressed  as 

(p+6pf^^  _( p  +  5pf  -p,, 

P’^^^  ~  P'^-Pk 
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This  could  be  further  simplified  as 

P  P  -Pk 

which  in  turn,  in  the  first  order  approximation,  leads  to 

p'' =(k+\)p^^  (6b) 

In  terms  of  this.  Equation  (4)  can  be  re-expressed  as 

={\-p)(k  +  \) 

from  which  we  get  a  recursive  computation 

(60) 

A:  +  1 

whence,  the  derived  service  rate  p(t)  is  obtained,  when  (6c)  converges, 

p{t)  =  J/p  (7) 

where  J  is  an  epoch  average  arrival  rate. 

This  we  defined  as  the  Equivalent  Markovian  Server  (EMS)  rate  of  our  server  where  the 
effective  service  distribution  may  not  be  exponential.  But,  in  order  to  use  (7)  we  needed  to  know 
the  maximum  number  of  buffers  with  which  a  server  was  endowed,  i.e.,  the  value  of  k  that  does 
not  explicitly  involve  the  variable  p .  Note  that  in  this  model,  the  functional  size  k  always 

obeyed  the  recursive  expression 

jlog{  p,(l  +  k)}  =  log(l-J^)  +  log(l- p,  +  +  +  p,  ) 


from  which  we  obtained 


log(l+k)+log  Pi, 
sum( k )^log{\~-  ) 


and  sum(k)  =  — —  + - ^ — -  + - ^ ^  + 

l  +  k  2(l  +  k)^  3tl  +  lt)^ 


(8) 


Using  an  average  of  k  as  obtained  in  (8)  over  the  sampling  distribution  we  could  arrive  at  an 
effective  estimate  of  k.  The  algorithm  to  be  followed  could  be  cast  as 

□  Start  with  a  good  estimate  (initial  value)  of  k.  A  typical  approach  would  be 

^imr  =  /(  queue  _length  \Pk^0) 


□  Use  (8)  to  get  an  estimate  of  k  when  it  converges 
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□  Use  the  average  of  all  estimates  over  some  sample  runs 

The  significance  of  Equations  (7)  and  (8)  could  be  appreeiated  as  follows.  If  the  true 
performance  statistics  of  a  server  were  not  known  a  priori  (for  instance,  SNMP  does  not  advance 
any  clue),  then  a  maximum  entropy  distribution  conjectured  at  a  level  of  maximum  ignorance 
would  always  lead  to  an  exponential  distribution  for  the  server  performance.  In  this  sense,  an 
effective  EMS  measure  projecting  a  server’s  service  rate  was  always  a  good  choice. 

In  deriving  appropriate  queuing  models  for  network  entities,  such  as  routers,  switches,  etc.,  one 
had  to  obtain  a  reasonable  estimate  of  the  server’s  service  rate.  For  every  Simple  Network 
Management  Protocol  (SNMP)  data  poll,  the  current  utilization  calculation  needed  to  be  based 
on  some  maximum  throughput  in  order  to  arrive  at  a  percentage.  This  allowed  us  to  gauge  how 
heavily  any  given  node  was  utilized. 

Since  SNMP-collected  data  on  these  parameters  did  not  exist,  we  had  to  take  a  more  relativistic 
approach.  For  each  node,  a  maximum  throughput  was  learned  through  long-term  polling  and 
service  rate  calculations.  If  we  discovered  a  service  rate  that  was  higher  than  what  we  thought, 
the  current  value  simply  replaced  the  old.  After  a  reasonable  amount  of  time  (e.g.,  24  hours),  we 
could  state  that  our  maximum  service  rates  were  reasonably  accurate.  Therefore,  each  poll’s 
service  rate  was  compared  to  the  maximum  service  rate,  and  we  arrived  at  our  percentage.  Each 
service  rate  was  relative  to  that  node’s  past  performance. 

Although  this  approach  did  not  yield  a  definitive  maximum  service  rate  for  a  node,  it  did  provide 
the  network  manager  with  perhaps  a  more  useful  statistic:  the  traffic  level  on  any  given  node 
relative  to  its  standard  traffic  level. 


2.1 .4  Aggregate  Model  Research 


A  natural  aggregate  model  for  any  subnetwork  (or  network)  of  n  nodes  Nj,  N2,  ....  N„  (e.g.,  n 
routers),  extending  the  node  model  of  Section  2.1.1,  is  a  multivariate  AR  model,  of  dimension  m 
=  n  q,  where  ^  is  a  number  of  variables  of  interest  at  each  node  (e.g.  q  SNMP  variables).  The 
simplest  example  of  such  a  model  corresponded  to  the  case  q  -  1.  For  example,  we  might  be 
interested  in  the  overall  packet  load  distribution  in  a  network  of  n  routers,  and  choose,  as  single 
subdimension  at  each  node  (router),  the  sum  of  two  SNMP  variables. 

ifPkts  =  ifInUcastPkts  +  ifInNUcastPkts 
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Note  that  this  modeling  approach  applies  to  local  area  networks  as  well.  In  any  case,  the 
machinery  of  Section  2.1.1  carries  over  to  networks  and  subnetworks.  In  theory,  it  is  not  even 
required  that  the  subdimension  be  the  same  for  all  nodes.  We  might,  for  example,  be  interested 
only  in  a  few  variables  in  the  “interior”  of  our  network,  but  require  more  information  (thus 
higher  subdimension)  at  the  “periphery”.  Hence,  in  general,  the  total  dimension  of  the 
multivariate  AR  model  is 

i=n 

m  = 

i=l 


where  qi  is  the  subdimension  at  node  iV,-. 

The  proposed  model  scaled  naturally.  New  nodes  simply  contributed  additional  dimensions  to 
the  model.  The  main  issue  remained  the  identification  of  model  change  points.  Our  approach 
was  to  perform  CUSUM  filtering,  as  explained  in  Section  2.2,  at  each  node  of  the  aggregate 
model,  then  take  the  union  of  the  so  identified  change  points,  eliminating  points  of  low 
significance.  We  noted  that  physical  characteristics,  such  as  clock  discrepancy  and  wire  distance 
were,  in  some  sense,  already  built  into  the  vector  AR  model.  Of  course,  the  model  computation 
complexity  scaled  up  with  the  size  of  the  network.  We  collected  and  analyzed  (as  in  Section 
2.1.1)  traffic  data  from  multiple  routers  and  switches. 

We  used  Traceroute  and  Ping  to  build  a  network  topology  model.  The  network  itself  was  seen  as 
connectivities  among  nodes  and  switches  along  with  some  kind  of  performance  measure  either 
and/or  on  the  connecting  links  and  individual  participating  nodes,  respectively.  This  provided  a 
logical  architecture  of  the  network.  Therefore,  if  we  had  n  nodes  in  a  communicating  network, 
the  overall  systemic  topology  3(nef)  was  an  aggregate  collection  of  individual  topologies 
rationalized  at  a  meta  level  where  the  entire  network  topology  3/  net )  was  visualized.  Therefore, 
the  network  topology  was  defined  as 

3( net )  =  Uc3(i I /€  nodes )  (9) 
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where  U<.  was  a  union  operation  of  all  the  individual  network  topologies  31  /|/e  nodes)  seen  from 
distinct  active  nodes  ie  nodes  with  the  subscript  c  referring  to  list  update  operation  such  that  the 
overall  view  3(  z  |  I'e  nodes )  was  internally  consistent  within  the  topology  list. 

Internally,  from  a  specific  node  k,  the  topology  was  the  set  31 1  comprising  minimally  the 
following: 

□  From  the  node  k  the  direct  links  to  all  other  neighboring  nodes  was  seen  along  with 
the  transmission  delay  and  its  associated  variance,  respectively,  as  obtained  in 
Algorithm  1  and  Algorithm  2  below  using  Ping/Traceroute  routines. 

□  The  congestion  levels  of  the  neighboring  nodes  of  k  through  Algorithms  3  and  4 
depicted  the  node-level  congestion,  regional-congestion  (within  1-hop  distance),  and 
district-level  congestion  (within  h-hop  distance)  in  either  an  optimistic  or  a 
pessimistic  frame  of  reference. 

The  set  2(k)  was  finally  realized  as  a  vector  in  Algorithm  5.  For  our  purpose,  we  allowed  the 
user  (also  known  as  a  controller/manager)  to  define  the  dimension  of  the  district  either  by 
selecting  a  region  around  a  node,  or  setting  the  maximum  hop  length  to  a  node  within  the  district. 
Also,  in  all  algorithms  below,  the  one  central  measure  was  the  transmission  time  or  a  sojourn 
time  of  a  probing  packet  from  a  node  k  to  a  node  1  introduced  as  a  property  of  the  link  (kl) 
connecting  the  two  nodes  viz.  { link( kl ),r( kl ),le  nodes  },  where  r(kl)  was  a  property  vector  with  the 
link(kl).  For  the  time  being,  we  assumed  it  to  be  a  scalar  indicating  an  ICMP  probe-packet 
transmission  time  from  node  k  to  node  1.,  t(kl).  Since  over  all  of  these  sampling  events  this 
measure  could  yield  a  minimum  of  three  estimates  of  r(kl),  namely  best  time,  worst  time,  and  the 
average  time  measure  for  any  pair  of  nodes  k  and  1,  we  could  have  three  topology  estimates  of 
the  network  from  any  specific  node.  Let  these  topologies  from  a  node  k  be  3h(k),Sjk)  and 

3„(  it),  respectively. 

2.1. 4.1  Algorithm  1 

From  manager  node  k,  the  agent  identified  the  list  of  all  neighbors  of  k  one  hop-away.  Let  the 

list  comprise  the  addresses  <  /p/ ,  IP^ IPj"  IP^  >■  This  list  was  updated  every  T  seconds.  Each 

node  IP^  in  the  above  list  sent  its  own  neighborhood  list  <lPi‘ ,lPl....,lPj>  to  the  manager  node  k 
every  T  seconds. 

The  manager  node  k,  using  all  such  lists,  obtained  the  global  neighborhood  lists  as  a  union  over 
all  these  individual  neighborhood  lists.  This  list  was  always  self-consistent.  The  global  list  was 
maintained  as  a  collection  of  adjacency  lists  implemented  as  arrays. 

For  each  item  /P/  in  the  global  list  within  a  distance  of  h  hops  from  the  manager  node  k,  its 
neighborhood  was  obtained  from  the  next  round  collection  and  appended  to  the  existing  global 
list.  After  kT  seconds,  step  1  was  repeated  to  update  its  list. 
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This  algorithm  would  be  used  to  produce  the  best  time,  worst  time,  and  the  average  time 
topology  for  the  network  evolving  from  the  manager  node  in  the  network.  These  time  estimates 
could  be  obtained  either  with  Traceroute  or  Ping. 


2.1. 4.2  Algorithm  2 

This  algorithm  would  be  basically  the  same  as  Algorithm  1  except  that  now  we  would  measure 
variance  of  the  random  variate  t(  provided  node  1  was  within  h  hops  of  the  manager  node 
k.  However,  in  this  case  we  would  have  only  a  single  variance  topology  map.  Either  Ping  or 
Traceroute  could  be  used  to  obtain  delay  samples  between  the  probing  node  and  the  probed  node. 
To  avoid  the  outliers,  we  discarded  10%  of  the  best  estimates,  and  10%  of  the  worst  delay 
estimates  to  compute  the  variance  on  the  remaining  samples. 


2.1. 4.3  Algorithms 


Using  Ping  and/or  Traceroute,  the  number  of  packets  dropped  at  each  node  with  a  packet  size  set 
at  p  was  obtained.  The  default  packet  size  was  64  bytes.  However,  since  we  were  interested  in 
realistic  cases,  the  packet  size  could  be  set  at  the  average  packet  size  the  IP  layer  usually  handles 
at  the  probing  node.  Using  this  option  a  packet-drop  map  at  the  nodes  of  topology  list  was 
obtained.  Let  the  computed  measure  of  the  count  of  number  of  packets  dropped  at  a  node  1  up  to 
time  t  since  the  last  count  at  an  earlier  time  dropi( r ^  at  a  node  1  obtained  at  a  time  t.  In  terms  of 
the  number  of  packets  actually  served  serv,(t)at  the  node  1,  the  probability  of  a  packet  being 
dropped  at  an  arriving  node  1  could  be  obtained  as 


prohrop(l\0=^ 


dropi(t) 

servi(t  )-\rdropi(t) 


(10) 


This  was  a  first  level  estimate  of  dropping  probability  at  a  specific  node  1.  If  the  system  were  in 
equilibrium  within  an  epoch  e,  all  such  probability  estimates  at  a  specific  node  would  be 
sampling  distribution  of  the  population  mean  within  the  epoch.  Accordingly,  this  algorithm 
yielded  for  a  node  1  the  population  mean 


t^e 

P(hrop)=  j  probjrop U  1  f  M  (11) 

/=0 

This  algorithm  using  the  measure  (1 1)  for  an  estimate  of  drop  probability  could  be  used  to  obtain 
the  regional  congestion  level  at  and  around  a  specific  node  1.  The  congestion  map  could  be 
defined  in  many  ways,  and  there  are  many  possible  varieties  of  definitions.  One  likely  definition 
that  was  suggested  at  this  stage  is  seen  in  the  next  algorithm. 

2.1. 4.4  Algorithm  4 

On  a  topology  G  =  (V,  E)  the  congestion-map  was  conjectured  as  follows.  At  a  node  1,  the  node¬ 
congestion  is 

cong(  1 1  node  )  =  p(  l^rop  )  (12) 
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We  defined  the  link-congestion  as  the  ratio  of  actual  amount  of  throughput  (in  bytes)  realized  on 
a  link  (i.e.,  the  number  of  bytes  moving  through  the  link  per  unit  time)  against  its  available 
bandwidth  B  in  bytes/sec.  This  information  could  be  obtained  from  SNMP  procured  data.  Let 
us  define,  accordingly, 

cong(link(kl))=Xa(t)/B  (13) 


where  f )  is  the  throughput  on  link  kl  at  time  t. 

The  regional  congestion  at  a  node  1  is  then  defined  as  follows: 

Regional  Congestion  (optimistic) 

cong(  k  I  regional )  =  1  -  f  1  —  cong(  k  |  node  jxf  1  -  mini  cong(  link(  kl )))  (14) 

Regional  Congestion  (pessimistic) 

cong(  k  I  regional  )  =  !-(!-  cong(  k  \  node )  x(  1  -  max,  cong(  link(  kl )))  (15) 

Of  course  such  measures  would  be  used  with  some  appropriate  correcting  factors  to  align  them 
properly  with  the  given  initial  and  boundary  conditions. 

The  regional  congestion  map  could  be  further  aggregated  to  produce  district  congestion  map  at 
and  around  a  given  node.  Either,  the  controller/manager  selects  the  region  around  the  node  in 
question  by  marking  it  appropriately,  or,  by  default,  the  district-region  of  a  node  k  would  be 
identified  within  two  hops  away  from  k.  Therefore, 

□  default-setting:  node  1  is  within  district  of  k  ,  if  hop _dist( kl)<2 

□  selected  specification:  node(l)  is  in  district(node(k))  if  node(  I  )e  marked  _set(  node(  k )) 

□  user  specified  district  diameter:  In  the  setting  box,  the  user  specifies  the  diameter  of 
the  district.  If  the  diameter  is  p,  then  node  1  is  within  district  of  k  when 
hop  _dist(kl  )<p.  Default  setting  is  a  special  case  of  this  with  p=2. 

The  district  congestion  map  would  be  either  optimistic  or  pessimistic.  The  map  would  be 
constructed  accordingly. 

Mark  the  entire  district  around  node(k)  with  a  color  or  a  number  (between  1  to  10)  directly 
proportional  to  the  computed  value  as  follows. 

For  an  optimistic  district  congestion  map  around  a  node(k): 

cong(k\district )  =  mini  cong"'"  \regional)y.(\-avgiCong^'''  \regional))  (16) 

For  a  pessimistic  district  congestion  map  around  a  node(k): 
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cong(  k\  district  )  =  !-(  I- tnaxi  cong""^  \  regional  )x(  I- avg  ,cong"‘"  \regional))  (17) 


2.1 .4.5  Algorithms 

This  algorithm  would  collect  the  results  obtained  from  all  the  above  computations  and  obtain  the 
property  vector  of  each  and  every  node  1  visible  from  the  manager  node  k.  Periodically  this  list 
would  be  updated,  predicated  by  the  sampling  frequency  of  the  traffic  profiles  around  a  node. 


2.2  TASK  2  -  REAL-TIME  MODEL  SELECTION 
2.2.1  Real  Time  Estimation  of  Model  Parameters 

The  estimation  of  the  model  parameters  of  Section  2.1.1  is  most  accurate  when  all  data  for  an 
epoch  are  available.  In  the  real-time  monitoring  context,  we  may,  of  course,  only  use  the  data 
collected  so  far  in  the  current  epoch  to  estimate  the  epoch’s  model.  We  have  conducted 
experiments  to  ascertain  the  effects  of  the  data  run  length  on  the  order  p  and  the  other  parameters 
A, ,  W,  C  of  the  models  in  order  to  come  up  with  useable  heuristics.  We  report  in  the  following 
tables  only  the  results  for  the  Ei" epoch  of  Section  2.1.1,  but  our  findings  for  most  other  epochs 
show  basically  the  same  patterns,  with  a  high  degree  of  similarity.  Table  5  shows  the  models 
computed  by  our  algorithm  when  only  the  first  20  data  points  of  the  epoch  were  used.  The 
results  differ  in  the  maximum  order  inputted,  which  is  a  parameter  of  the  algorithm. 
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Table  5.  Parameters  for  Ei'"  Using  the  First  20  Data  Points 


max  order 

P 

proposed  to 
algorithm 

actual 

P 

computed 

coefficient  matrix 
A, 

computed 

intercept  vector 
W 

computed 

noise  covariance 

C 

computed 

3 

1 

[0.6836] 

10^  *  [3.0647] 

10*^*  [2.8155] 

4 

1 

[0.7100] 

106  [2.7729] 

lO'^  *  [2.9568] 

5 

3 

[0.7507] 

[-0.3034] 

[0.3259] 

10^  *  [2.3690] 

10^^*  [3.1143] 

6 

3 

[0.7292] 

[-0.2746] 

[0.3015] 

10^  *  [2.3690] 

10*^  *  [3.3997] 

7 

5 

10®  *  [4.4684] 

10^^*  [3.1387] 

8 

8 

10®*  [-1.7930] 

10^^  *  [4.9017] 

The  maximum  order  p  proposed  to  our  algorithm  cannot  be  too  high  compared  to  the  data  length 

r.  In  fact,  when  it  approached  — ,  then  the  model  was  over-fitted  to  the  observed  data,  as  can  be 

2 

seen  from  the  high  value  of  the  actual  computed  order.  The  parameters  A,  and  W  found  were  in 
this  case  consistent  with  the  values  shown  in  Table  2,  which  implies  that  the  noise  covariance 
would  be  inconsistent.  Intuitively,  the  algorithm  had  to  decide  how  much  of  the  relatively 
meager  data  it  saw  was  fitted  or  treated  as  noise.  Tables  6  and  7,  on  the  other  hand,  show  that 
the  data  seen  may  be  accounted  as  more  of  random  fluctuations  and  not  as  trends. 
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Table  6.  Parameters  for  Ei'"  Using  the  First  30  Data  Points 


max  order 

P 

proposed  to 
algorithm 

actual 

P 

computed 

coefficient  matrix 
A, 

computed 

intercept  vector 
W 

computed 

noise  covariance 

C 

computed 

3 

1 

[0.3097] 

10®  *  [6.0854] 

10‘^  *  [7.3223] 

4 

1 

[0.3137] 

10®  *  [6.0365] 

10*^  *  [7.6227] 

5 

1 

[0.3096] 

10®*  [6.1309] 

lO'^  *  [7.8598] 

6 

1 

[0.2956] 

10®  *  [6.3179] 

10‘^  *  [8.2502] 

7 

0 

none 

10®  *  [9.1548] 

10‘^*  [3.1387] 

Table  7.  Parameters  for  £/"  Using  the  First  40  Data  Points 


max  order 

P 

proposed  to 
algorithm 

actual 

P 

computed 

coefficient  matrix 
A, 

computed 

intercept  vector 
W 

computed 

noise  covariance 

c 

computed 

3 

1 

[0.3165] 

10®  *  [5.9284] 

10^^  *  [6.3277] 

4 

1 

[0.3200] 

10®  *  [5.8861] 

10^^  *  [6.5080] 

5 

1 

[0.3192] 

10®  *  [5.9316] 

10^^  *  [6.6487] 

6 

1 

[0.3117] 

10®  *  [6.0341] 

10*^  *  [6.8025] 

7 

1 

[0.2916] 

10®*  [6.2931] 

10^^  *  [6.7655] 

8 

1 

[0.2926] 

10®  *  [6.2826] 

10*^  *  [6.9909] 

19 

0 

10®  *  [8.3674] 

10‘^  *  [8.3501] 

Here,  when  the  maximum  order  p  proposed  approached 


r 

2’ 


then  the  observed  data  were  all 


treated  as  the  noise  as  can  be  seen  from  the  zero  order  computed.  The  parameters  A,-  and  W 
then  had  values  somewhat  inconsistent  with  those  of  Table  2.  But,  this  was  not  as  odd  as  it 
appeared,  since  the  high  value  of  the  intercept  vector  accounted  for  the  low  value  of  the 
coefficient  matrix.  The  heuristic  derived  from  our  experiments  could  be  stated  as  follows. 


□  Within  a  new  and  same  epoch,  sample  20  times,  if  no  change  point  had  been  signaled 
(see  Sections  2.1.2  and  2.1.3),  to  compute  the  current  model;  input  a  low  maximum 
order  (3  or  4)  to  the  algorithm;  if  so  desired,  recompute  the  model  every  S  samples 
(e.g.  S=  10  or  20). 


□  If  a  change  point  was  signaled,  repeat  with  the  approach  stated  above. 
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2.2.2  Model  Comparison  Research 


A  decision  as  to  the  adequacy  of  a  model  Mo’  based  on  the  current  value  of  some  numerical 
quantity  Q,  usually  referred  to  as  the  monitor,  chooses  between  three  alternatives. 


Aa  acceptable  model  performance;  reinitialize  monitor  as  appropriate 
Af  further  observation  needed;  update  monitor  accordingly 
As  signal  model  performance  inadequacy. 


Initially,  we  wanted  to  concentrate  on  monitors  utilizing  so-called  Bayes’  factors,  sometimes 
called  weights  of  evidence,  which  may  be  defined  as  follows.  Suppose  we  must  compare  two 
models  Mq  and  M\-  At  time  t,  suppose  we  accumulated  some  historical  information  if 

any,  that  may  help  us  decide.  At  time  t,  suppose  we  have  an  observed  quantity  x,  (a  random 
variable  of  interest,  such  as  the  SNMP  ifInOctets  variable).  The  Bayes’  factor  for  Mq  versus 


M\ ,  at  time  t,  is  the  ratio 


Ht  =  HtO) 


p(XJMo.Dm) 


That  is  the  odds  that  x,  comes  from  Mq  against  that  it  comes  from  M\^  given  More 

generally,  if  k  consecutive  observed  values 


Xt-M' Xt-k+2'  ■■•Xt-\'Xt 


are  available,  then  the  Bayes’  factor  for  Mq  versus  Afi .  at  time  t,  is  the  product  Ht^h)  of  the 
quantities  //mW  =  Hm  above,  form  =  t-k+1,  ...,t.  Clearly, 

H,(k)  =  Hfl)H,.](k-l) 

for  t  >  1,  i.e.,  evidence  for  Mo  versus  Mi  accumulates  in  multiplicative  fashion,  as  new 
observations  come  in  (or  in  additive  fashion  in  log  space).  Let 

h(k)  =  h(0)  H,(k) 

where  h(0)  is  the  initial  condition,  viz.  the  a  priori  odds  ratio  for  Mo  versus  Mj.  h(k)  may  be  used 
as  monitor  by  prescribing  thresholds  a  and  P,  with  a  >  P  >  0,  and  by  choosing 

As  if  h(k)  >  a 

Aa  if  h(k)  < 

Af  otherwise 

Experiments  implemented  in  Matlab,  and  the  real-time  computation  requirements  for  monitoring 
a  large  number  of  models  (corresponding  to  nodes  and  links  in  a  large  computer  network),  have 
led  us  to  reevaluate  the  use  of  Bayes’  factors  as  monitor  for  computer  network  models.  In 
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statistical  process  control,  a  common  quantity  used  as  monitor  is  the  so-called  CUSUM,  or 
cumulative  sum.  We  will  expand  on  it  in  Section  2.2.3,  but  in  this  section,  we  show  that  its  use 
in  our  network  monitoring  investigations  may  be  justified  both  in  practical  and  theoretical  terms. 

With  Aa,  Af,  and  As  defined  above,  let  X],  X2,  Xr  he  r  observed  values  of  the  variable  of 
interest,  since  a  Aa  or  As  decision  was  made.  In  other  words,  the  r  observed  values  correspond  to 
a  run  of  r  Af  decisions.  The  CUSUM  technique  uses  a  pair  of  monitors  referred  to  as  upper  and 
lower  CUSUMs.  Abstractly,  an  upper  CUSUM  decision  scheme,  denoted 

CUSUM^(A,  p,  a) 


uses  the  quantity 

Ck  =  i(Xi-A)  =  Cit-i  +  Xjt-A 

1=1 


as  monitor. 

and  chooses 

As 

if 

>  a 

Aa 

if 

<  P 

Af 

otherwise 

For  the  moment,  think  of  A  as  a  target  value  for  a  quantity  being  monitored,  and  of  a  and  with 
a>  P  >  0,  d&  prescribed  thresholds.  A  lower  CUSUM  scheme,  denoted  CUSUM‘(A,  P,  a),  is 
defined  in  a  similar  fashion,  with  the  monitor  satisfying 

Ck  =  +  ^  -  Xk 

On  the  surface,  the  above  monitors  appear  unrelated.  Consider,  however,  the  family  of 
exponential  distributions,  including  normal,  gamma,  and  Weibull.  A  member  of  this  family  has 
a  natural  scalar  parameter  rj,  and  may  be  described  by  a  probability  distribution  of  the  form 

p{x,n)  = 


for  some  convex  function  a{ri),  and  factor  function  (pu).  Two  exponential  models  that  are 
identical  except  for  the  value  of  the  natural  parameter  r]  may  be  compared  by  means  of  the 
Bayes’  factor  monitor  defined  above.  Here,  using  r  observations,  we  have 


logh(r) 

logHM  =  (pvo-m) 


=  logh(0)  +  logH^r) 

a(?jQ)-a{Tli) 


r 

1 

t=ii 


^0-^1 


The  decision  scheme,  model  defined  by  fjo  versus  model  defined  by  t]i,  described  by  this 
monitor,  and  using  thresholds  a>  P>0,  is  then  equivalent  to  CUSU]VF(A,  P’,  oC),  where 
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^  ^  ciir]Q)-a{7]^) 
^0-^1 

p,  _  logyg-log/t(0) 
-  rji) 

,  _  logo;  -  log fe(0) 

HrjQ-r]^) 


For  example,  comparing  two  normal  distributions  with  unit  variances  and  means  ^  and  fij  with  a 
Bayes’  factor  monitor  of  thresholds  a  >  >  0,  is  equivalent  to  comparing  them  using  a 


CUSUM^(A,  p’,  or*),  with  P’  =  0,c^  = 


logg  -  log^(O) 


and  A  = 


Pi 

2 


For  general  distributions,  we  do  not  expect  that  CUSUM  and  Bayes’  factor  decision  schemes  are 
equivalent.  However,  given  the  simplicity  and  efficiency  of  CUSUMs,  and  given  the  theoretical 
equivalence  for  exponentials,  the  use  of  CUSUMs  is  justified. 


2.2.3  Model  Change  Point  Identification 

We  distinguished  two  interrelated  cases  of  change  point  estimation  for  network  data.  The  non¬ 
sequential  case  was  the  off-line  problem  described  as  follows.  The  network  data  X(t)  for  a  time 
interval  [0,T]  was  available  and  assessed  as  a  multivariate  model  parameterized  by  a  set  S  of 
parameter  matrices  and  vectors,  for  t  s  [0,T].  The  problem  was  then  to  test  the  null  hypothesis 
of  no  change  in  S  versus  the  alternative  hypothesis  that  there  existed  x  6  (0,T)  such  as  there  was 
a  change  in  the  underlying  process  generating  the  data  in  (x,T).  The  sequential  case  was  the  on¬ 
line  real-time  problem  where  the  network  data  arrived  sequentially.  The  change  point  should  be 
recognized  as  soon  as  possible  after  it  occurred,  since  the  false  detection  rate  needed  to  be  as  low 
as  possible.  The  above  two  versions  of  the  change  point  analysis  problem  were  much  related  in 
that  the  off-line  analysis  could  provide  insight  into  the  “distribution”  of  change  points  in  the 
network  data. 

We  investigated  various  ways  of  improving  the  model  of  Section  2.1.1  for  real-time  computation 
and  assessment.  In  particular,  we  looked  at  a  possible  use  of  principal  component  expansions  of 
multivariate  time  series.  Such  expansions  write  the  original  time  series  as  a  sum  of  a  small 
number  of  components  X(t)  =  Aj(t)  +  A2(t)  +  ...  +  An(t)  +  s(t).  The  A,-  were  assumed  to  be 
independent  in  some  sense,  and  also  interpretable;  £(t)  as  a  random  noise  component. 
Interpretability  here  related  to  factors  such  as  trend,  cyclicity,  and  other  harmonic  interpretations. 
For  the  off-line  problem,  we  looked  at  wavelet-based  techniques  as  well,  and  compared  with 
them  with  the  spectral  outputs  of  Arfit. 
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Given  the  data  we  collected  so  far,  it  might  actually  be  more  reasonable  to  implement  a  tamer 
problem,  namely,  that  of  the  detection  of  abrupt  changes  in  the  mean  only.  For  this  problem, 
sophisticated  tools  like  wavelets  and  principal  component  expansions  may  offer  little 
improvement  over  traditional  techniques  like  the  Kolmogorov-Smimov  tests  for  change  in 
distribution  and  for  change  in  spectrum. 

The  model  abstraction  problem  was  explored  from  a  real-time  network  environment  perspective, 
usually  admitting  different  system  models  (such  as  queue-theoretic)  at  its  different  equilibrium 
states.  To  computationally  depict  system  states  at  any  level  of  abstraction,  it  was  necessary  to 
identify  correct  models  consistent  with  observables.  However,  any  such  system  identification 
need  not  be  permanent,  particularly  for  a  dynamic  system.  In  such  situations,  as  the  system 
appears  to  migrate  from  one  equilibrium  state  to  another,  one  should  be  able  to  quickly  identify 
an  event  of  context  switching  from  one  model  abstraction  to  another.  In  this  section,  we  show 
how,  using  a  variation  of  traditional  CUSUM  statistical  approaches,  one  could  identify  model 
change  events  on  time. 

Our  objective  was  to  explore  the  problem  of  model  abstraction  for  a  dynamical  system  such  as 
computer  networks,  to  be  realized  in  real-time  so  that  it  could  be  monitored  and  controlled 
effectively.  If  a  system  were  capable  of  being  in  only  one  equilibrium  state,  we  would  have  the 
simpler  problem  of  deriving  a  consistent  system  model  given  its  observed  performance  values 
such  as  those  appearing  as  SNMP  variables.  However,  a  network  is  capable  of  being  in  a 
number  of  plausible  equilibrium  states.  Even  if  we  conceded  that  we  would  concern  ourselves 
primarily  with  queuing  models  at  a  network  node  or  a  link  (client-server  type  queuing  models 
requiring  buffer),  we  would  have  a  number  of  different  models  from  which  to  choose.  The 
distribution  types  of  the  traffic  rates  would  undoubtedly  predicate  the  functional  model.  A 
M/M/1  queuing  model  at  a  server  node  could  suddenly  change  to  a  G/M/l/k  type  model,  and 
unless  we  captured  this  model-change  event  on  time  we  would  almost  surely  have  a  wrong 
profile  of  the  attendant  node.  Ideally  then  it  was  essential  to  find  out  when  and  to  what  kind  of 
model  a  given  system  migrates  as  it  moves  from  one  equilibrium  state  to  another.  This  was 
attempted  in  this  section 

Our  observables  were  all  SNMP  brokered  time-series;  using  these  we  should  be  able  to  realize  a 
specific  parametric  system  model.  For  instance,  at  an  interface  where  a  node  was  connected  to 
the  communication  net,  we  measured  an  observable  like  “IfOutOctets”  depicting  the  amount  of 
octets  sent  out  at  the  interface  by  the  server  since  the  last  observation.  These  observables  were 
our  only  clue  to  infer  profiles  of  individual  nodes  and  links,  to  infer  the  entire  network  as  a 
function  of  these  SNMP-supplied  RMON  variables.  The  next  logical  step  was  to  devise  a  way  to 
ascertain  a  model  change  point. 

For  our  purpose,  we  considered  a  real-time  dynamical  system  whose  observed  state  s  changed 
in  a  non-deterministic  fashion  over  time.  Our  objective  was  to  depict  /?  by  a  predictive  model 
M(  R)on  a  real-time  basis  through  its  observable  s.  If  at  a  time  t  the  underlying  system  model 
was  M(  /?)we  should  be  able  to  predict,  in  view  of  this  model,  its  system’s  state  5(  t' )  at  a 
nearby  time  f  =  t  +S  t  given  that  we  knew  its  system  state  s  at  an  earlier  time  t . 
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However,  a  dynamical  system  may  change  its  behavior  during  its  existence.  A  change  in  system 
behavior  may  not  always  be  explained  in  terms  of  the  current  system  model  and  it  might  be 
necessary  to  consider  a  different  model,  which  may  help  understand  the  current  system  behavior 
better.  In  view  of  this,  we  needed  to  realize  a  procedure  to  pinpoint  the  time  instance  when  a 
system  changed  its  model  from  M( R)to  M'(R).  When  the  underlying  model  affecting  the 
system’s  behavior  changed,  the  result  was  a  system  state  change.  However,  the  reverse  was  not 
necessarily  true.  Not  all  state  changes  would  be  an  indication  of  a  model  change.  Within  the 
scope  of  M(  R)sl  change  s  ^  s'  might  be  a  valid  observation  and  it  might  not  point  to  any 
model  change  at  the  system  level.  Our  objective  in  this  section  was  to  obtain  a  systematic 
approach  where  we  would  be  able  to  decipher  two  types  of  events  from  each  other,  namely  (we 
used  the  symbols  ^  and  to  denote  “changes  to”  and  “logically  implies”,  respectively), 

(s(t)  ^  (M  (R),  A  M '(/?),- )  (18) 

(s(t)  ^  s(t'))  =>  (M  (R),  A  M  (/?),- )  (19) 

Events  depicted  in  (18)  were  a  model  change  event  and  those  portrayed  in  (19)  were  a  model 
invariance  confirmation  at  two  sampling  events  at  time  t  and  t’ ,  respectively.  We  did  require 
M(R ),  be  the  most  relevant  and  consistent  model  for  the  real  time  system  /?  at  a  time  t  if  the 
predicted  system  state  s(t'),  using  this  model,  at  a  future  time  t'  was  statistically  close  and 
similar  to  the  observed  system  state  s(t').  For  any  time  instance  pair,  either  (18)  or  (19)  would 
apply.  It  was  possible  that  between  the  sampling  events  at  times  t  and  t' ,  the  system  model 
could  change  more  than  once  and  yet  show  as  though  at  a  later  time  t'  the  computed  system 
model  remaining  the  same  as  the  last  one.  This  was  perfectly  acceptable. 

To  detect  an  instance  of  a  model  change  time  point,  we  partitioned  the  monitored  time  length 
(the  observation  period)  into  a  set  of  consecutive  epochs  ee  £.  Every  epoch  began  when  a 
model  change  occurred,  and  therefore,  the  length  of  an  epoch  —  most  likely  of  variable  size  -  was 
the  length  of  the  time  duration  during  which  the  system  model  for  all  computational  and 
controlling  purposes  remained  unchanged.  Accordingly,  we  defined 

((tee)A(t'ee')A(M(R),  aM(  R),- ))=>(e  =  e' )  (20) 

(tee)A(t'e  e')A(M(R),  aM'(R),-  )=^(e^e)  (21) 

As  indicated  earlier,  the  duration  of  an  epoch  could  vary  since  it  was  predicated  by  a  model 
change  event.  Since  the  system  model  remained  invariant  over  an  epoch  e,  we  required  that  the 
relevant  system  sampling  statistics  on  system  states  also  remain  invariant.  If  an  epoch  was  large 
enough  we  could  partition  it  further  into  slots  of  fixed  size  A.  Each  slot  would  comprise  a 
bundle  of  samples  on  observations  capable  of  yielding  relevant  system  statistics  in  each  slot. 
Our  assumption  that  A  was  large  (large  e)  was  necessary  to  ensure  that  we  had  enough  samples 
to  gamer  appropriate  inferences  on  epoch  means,  variances,  etc.  Within  an  epoch,  all 
distributions  were  assumed  stationary  (otherwise,  a  model  change  event  took  place),  and 
therefore,  we  assumed  that  the  sampling  distribution  of  slot  means,  slot  variances,  etc.  pointed  to 
epoch  means,  epoch  variances,  etc.  Obviously  such  assumptions  were  ideal;  an  epoch  e  need  not 
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be  large  enough  to  accommodate  large  enough  slots  to  render  appropriate  statistical  inferences. 
But  for  our  model  we  assumed  this  to  be  true.  The  outline  in  the  next  paragraph  suggests  a  way 
to  infer  a  model  change  event  at  a  meta-model  level  at  a  time  point  using  appropriate  CUSUM 
statistics  often  used  in  production  planning  and  quality  control. 

If  the  underlying  system  model  did  not  change,  it  would  be  reflected  in  the  time  profile  of  a 
CUSUM  statistics  similar  to  one  we  would  find  in  a  quality  or  production  control  environment. 
For  our  purpose,  we  proposed  the  following  CUSUM  of  order  k  as  defined  below.  In  order  to 
approach  our  problem  comprehensively,  we  first  showed  how  in  a  single  variable  system  one 
could  identify  model  change  point.  We  next  extended  this  to  introduce  model  change  point 
estimation  on  a  bundle  of  time  series. 

Consider  a  single  strand  of  time  series  jc  depicting  a  single  variable  system.  On  this,  let 

xl’"(t)  =  ith  sample  of  the  observable  x  in  slot  s  in  epoch  e 
=  slot  sample  mean  of  observable  x  in  slot  s  in  epoch  e 

Given  these  we  could  define  higher  order  moments  of  system  observables  as  follows.  We 
defined  for  a  slot  s,  the  kth  moment, 

^  N 


where  N  was  some  normalization  constant  in  terms  of  which  the  average  of  the  kth  moment  over 
all  slots  in  an  epoch  e  was  the  epoch  mean  of  the  kth  moment,  which  we  assumed  to  be 
computationally  available  at  the  beginning  of  our  excursion  of  the  epoch  e. 

n)  =-!-!( -m‘  fds  for  k  =  1, 2, ...  K.  (22) 

n,N 


We  now  defined  two  CUSUM  statistics  as  a  function  of  time 


)ntax{x^,'‘ -u‘ -T^.O -m‘  -r^O) 


)min( xf  "  -//*  )min(V^’‘  -//*’*  +Tt,0) 


,k,e 


(23a) 

(23b) 


Notice  that  the  epoch  mean  /^*was  the  same  as  the  sampling  slot  mean  //***  under  the 
assumption  we  proposed  earlier.  Also,  the  sampling  means  of  variance  and  of  other  higher  order 
statistics  also  converge  to  sampling  distribution  of  epoch  statistics.  The  constants  t"  were 
appropriate  threshold  parameters  acting  as  filters  for  ease  of  computation. 


For  our  problem,  we  restricted  the  CUSUM  (23a)  and  (23b)  to  a  second  order  product  and 
obtained 
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cs_ 


=  -JU‘  <IX‘  -p^  f-E(xr  -p‘  f  f  <  -y-' 


?■) 


(24b) 


where  5('&> g)  =  max((-&-q-T),0)  and  8{‘d<g)  =  min((-d-q-\-v),0) . 

The  constants  r,  v  were  appropriate  threshold  parameters.  Note  that  a  measure  of  the  ^th 
moment  of  a  distribution  across  the  epoch  E{x\'‘  -p‘  t  was  computed  as  an  average  over  all 
exploratory  slots  before  actual  monitoring  occurs  in  the  rest  of  the  epoch.  In  other  words,  at  the 
beginning  of  every  epoch,  the  first  few  slots  would  be  used  to  determine  the  target  statistics  for 
the  rest  of  the  epoch.  It  is  against  these  measures  the  attendant  variable  would  be  monitored. 

For  convenience,  we  divided  the  CUSUM  statistics  by  an  appropriate  normalization 
constant.  A  sharp  change  in  the  time-derivative  of  the  statistics  (either  of  them)  portended 
an  end  of  an  epoch  and  the  beginning  of  the  new  one,  where  a  new  model  M'(R)  appeared  to  be 
the  most  likely  one.  Therefore,  at  a  sample  point  s{t)  we  computed: 

□  obtain  d{cs^‘  )/dtaX  the  sampling  time  t 

□  (2/n )abs(arctan(d(cs''‘  )/dt ))>r,^eshoid  implied  a  model  change  at  the  time  point  t. 

□  The  strength  of  the  hypothesis  that  there  was  a  model  change  at  time  t  where  (19)  was 
observed  was 


^threshold  (between  0.0  and  1.0)  (25) 

The  criterion  offered  two  things.  On  a  single-strand  mode  (system  with  a  single  variable  time- 
series),  it  offered  the  best  estimate  of  a  time-instant  when  the  system  underwent  a  model  change, 
and  it  offered  a  measure  of  belief  (on  a  scale  of  0.0  to  1.0)  that  there  was  a  change  point  at  the 
time  indicated. 

Our  CUSUM  definition  in  (23)  offered  us  two  simultaneous  opportunities  to  detect  a  model 
change  event.  The  first  term  in  the  product  referred  to  the  immediate  change  as  observed  in  the 
time  series  x  from  the  standpoint  of  the  target  value,  the  epoch  mean  .  The  measure  here  was 
a  short  term  localized  one  subject  to  vagaries  of  noise.  The  second  term  in  the  product  was  a 
measure  over  a  long-range  time  span  (and,  hence,  less  vulnerable  to  local  noise)  as  the  target  was 
compared  with  the  neighborhood  variance.  Together,  our  definition  rendered  a  better  measure 
for  a  model  change  point:  it  could  detect  a  model  change  event  if  its  epoch  mean  shifted  (keeping 
its  variance  constant),  or  its  variance  shifted  (keeping  its  epoch  mean  constant),  or  both. 

The  single  strand  decision  criterion  could  be  extended  to  a  multitime  series  decision  base.  The 
system  state  s  was  a  vector  with  p  components,  where  p>2.  The  component  stated  s,( t )  and 
sj(t-l )  could  be  correlated.  Since  the  complexity  of  the  decision  framework  in  such  a  case 

would  be  horrendous  for  any  real-time  system  projection,  for  the  time  being,  we  ignored  this 
possibility  and  assumed  that  all  p  dimensions  on  the  state  space  were  mutually  independent. 
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Using  the  above  CUSUM  statistics,  we  could  obtain  for  each  component  of  s  the  change  event 
point  around  time  t  if  they  at  all  occurred.  Let  L  be  the  list  of  all  such  events  occurring  between 
time  t  and  t-t-dt^.  The  time  instance  t  referred  to  the  earliest  event  time;  was  the  window  of 

opportunity  to  detect  other  event  points  around  t.  For  each  model  change  event  registered  in  the 
list  L  we  associated  a  pair  of  numbers:  the  time  of  detection  tl,^,^^,and  the  computed  strength  of 
belief  Ci(t(ieiea )  that  a  model  change  event  was  detected  on  the  time  series  for  the  state  5,.  Within 
this  window  ,  there  was  a  cluster  of  event  points.  We  proposed  that  the  center  of  such  a 
cluster  (along  with  its  consolidated  strength)  was  the  most  likely  occurrence  of  a  model-change. 


The  center  of  an  event  cluster  around  t  could  be  computed  as  follows.  Suppose  the  current 
cluster  coordinate  was  at  ) where  was  the  number  of  state  components 

clustered  at  the  time  point  with  a  consolidated  strength  Ciast  ■  This,  along  with  a  single  point 
model  change  event  at  some  coordinate  point  needed  to  form  a  new  cluster  center.  The 

time  point  of  the  new  cluster  was  computed  recursively  as 


_  ^hasflast  ^ 
+  1 


(26a) 


Similarly,  the  consolidated  possibility  index  of  the  new  cluster  was  computed  as 


y  ^last^last  ^  ^ ^ 

- 77-^  (26b) 

niast  +1 

Notice  that  (t<ti^,<t  +  &^)A(t<tp<t  +  dt^)^(t<t„^„<t  +  dt^)  and  0 <$■„,,,<  1.0 .  One  might 

argue  that  a  single  strong  model  change  point  could  be  overwhelmed  by  a  large  number  of  weak 
model  change  points  yielding  a  false  trigger.  But,  one  could  also  look  at  it  a  different  way.  A 
single  strong  showcase  (with  a  possibility  value  near  1.0)  could  be  construed  as  a  noise 
particularly  when  a  large  number  of  state  parameters  were  showing  a  model  change  at  a  different 
time  neighborhood  even  though  their  individual  claim  to  a  model  change  event  was  only 
mediocre. 


Using  the  Matlab  computing  environment,  we  first  implemented  one-sided  CUSUM  filters  for 
detecting  model  change  points  in  network  traffic  data.  Exhibit  2,  shows  the  first  dimension  of 
the  series  of  Exhibit  1,  i.e.,  number  of  arriving  octets  in  a  router  interface,  in  our  traffic  data 
collection  environment. 


-26- 


Our  CUSUM  filters  produced  the  pulses  and  pulse  clusters  shown  in  Exhibit  3  for  the  upper  and 
lower  first-order  CUSUMs  of  the  data  in  Exhibit  2,  with  the  lower  CUSUM  graphed  as  its 
negative,  to  contrast  the  two  one-sided  components. 


Exhibit  3.  One-sided  First-order  CUSUM  Puises  for  Exhibit  2 
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Cluster  centers  were  computed  to  f  s  115  and  t  =  165. 

Filters  for  second-order  CUSUMs,  without  embedding  of  the  first-order  CUSUMs,  produced  the 
pulse  sets  of  Exhibit  4  for  the  same  data  of  Exhibit  2. 


Exhibit  4.  One-sided  Second-order  CUSUM  Pulses  for  Exhibit  2 


As  Exhibit  4  shows,  the  results  for  higher-order  CUSUMs  seem  to  suffer  from  additional  noise 
than  in  the  first-order  case.  Future  work  should  address  more  stable  implementations  of  higher- 
order  CUSUMs. 

A  trivial  but  useful  byproduct  of  the  first-order  and  second-order  evaluations  above  was  a 
heuristic  for  model  change  point  detection: 

□  If  both  first-  and  second-order  filters  signal  a  change  point  at  r  (computed  with  the 
clustering  scheme  above),  then  declare  a  model  change  point  at  r. 

□  If  only  the  first-  or  only  the  second-order  filter  signals  a  change  point,  then  declare  a 
model  change  point  if  the  pulse  amplitude(s)  exceed(s)  some  given  threshold  9. 

Superimposing  the  first-  and  second-order  CUSUM  results  of  Exhibits  3  and  4  produces  the 
following  graph  (Exhibit  5). 
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Exhibit  5.  Combined  Clusters  from  First  and  Second-order  CUSUM  Filters 


Cluster  centers  were  reconfirmed  to  hold  at  t  =1 15  and  t  =  165. 


2.3  TASK  3  -  OBJECT  MODELING  AND  SOFTWARE  DESIGN 
2.3.1  Object  Modeling  our  Software  Prototype 

The  object  model  of  our  software  prototype  had  proceeded  on  two  fronts-change  point  detection 
and  vector  autoregression.  In  regard  to  change  points,  after  careful  examination  of  the  real-time 
computational  costs  (under  the  Java  language)  of  the  mathematical  models  of  Tasks  1  and  2,  a 
design  and  implementation  decision  was  made  to  consider  only  CUSUM  filter  classes  in  lieu  of 
more  general  Bayes  factors  (Exhibit  6). 
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Exhibit  6.  UML  Class  Diagram  for  CUSUM  Filter 


These  CUSUM  classes  were  given  slices  of  time  series  data,  and  then  computed  model  change 
points  based  on  mean  and  variance  decision  schemes.  It  was  up  to  the  appropriate  network 
management  software  component  to  correlate  the  outputs  of  the  CUSUM  filters  to  the  results 
from  other  components,  to  decide  the  form  of  feedback  for  the  human  user,  and  to  call  for  any 
necessary  further  processing.  The  Unified  Modeling  Language  (UML)  sequence  diagram  shown 
in  Exhibit  7  could  summarize  the  interaction  between  CUSUM  and  its  clients  (for  example,  a 
network  manager). 
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Exhibit  7.  UML  Interaction  Diagram  between  CUSUM  and  Its  Clients 


Client  to  Cusum : 

CusumFilter : 

AnvCIient 

CusumFilter 

filter(nmeSeriesData) 


CusumChangeData 


As  depicted  in  Exhibit  6,  the  CUSUM  object  model  consisted  of  three  main  classes,  viz. 
TimeSeriesData,  CusumChangeData,  and  CusumFilter.  The  first  two  classes,  respectively, 
encapsulated  the  input  and  output  to  the  third  class.  Both  were  serializable  in  the  Java  sense,  so 
they  could,  for  example,  be  used  as  arguments  or  return  types  of  Java  RMI  (Remote  Method 
Invocation)  methods,  if  the  CUSUM  filters  were  distributed  over  multiple  hosts  and  devices. 

Note  that  not  all  instance  variables  of  the  CusumFilter  class  are  shown  in  Exhibit  6.  For 
example,  our  algorithms  needed  to  keep  track  of  the  last  computed  two-sided  CUSUM  values, 
for  both  mean  and  variance,  and  also  the  run  lengths  for  non-zero  CUSUM  values  since  the  last 
reset.  We  kept  these  data  in  variables  such  as  LastMeanUpperCusum  and  nzMeanUpperCusum. 
Data  and  associated  collection  times  were  also  accumulated  in  the  CusumFilter  object  as  it  was 
not  known  in  advance  when  a  change  point  would  occur.  We  stored  the  accumulated  data  in 
variables  TimeLabelRun  and  DataValueRun.  An  integer  variable  called  ResidualLength  kept 
track  of  the  data  not  yet  processed,  as  the  filter()  method  of  the  CusumFilter  object  returns.  It 
was  conceivable  to  rearrange  the  computations  in  a  slightly  different  manner,  by  making  the 
CusumFilter  object  always  exhaust  the  data  passed  to  it,  and  return  a  collection  of 
CusumChangeData  objects.  We  decided  against  this  second  scheme,  as  we  wanted  the  detection 
and  publishing  of  change  points  to  occur  as  soon  as  possible  for  real-time  sake.  Details  about  the 
algorithms  used  by  the  CusumFilter  class  are  given  in  Section  2.4.3. 

Regarding  vector  autoregression,  moving  from  our  original  Matlab  code  to  Java  required  not 
only  objectifying  the  autoregressive  (AR)  algorithms,  but  also  selecting  appropriate  Java 
numerical  packages. 
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Two  packages,  Jampack  and  JAMA,  from  NIST  (National  Institute  of  Standards  and 
Technology)  were  found  extremely  useful.  We  used  both  in  our  implementation  since  in  many 
ways  they  were  supplementary.  Jampack  provided  most  of  the  required  functionality  but  lacked 
the  precision  we  wanted  for  Cholesky  decomposition  algorithms. 

Exhibit  8.  UML  Ciass  Diagram  for  AR  Modeiing 


DimensionedUpperT  riangularQR 
^>upperTriangularQR :  Zmat  =  null 
l^stateSpaceDimension  :  int  =  0 
l^ountBlockEquations :  int  =  0 
^minimumOrder ;  int  =  0 
^maximumOrder ;  int  =  0 

DimensionedUpperT  riangularQR{) 

^etUpperT  riangularQR() 

^etStateSpaceDimensionO 

^etCountBlockEquationsO 

■^getMinimumOrderO 

^etMaximumOrderO 


RegularizedUpperTrianguiarQR 
^upperTriangularQR :  Zmat  =  null 
^egularizationVector :  Zmat  =  null 


DegularizedUpperT  riangularORQ 
%|etUpperT  riangularORQ 
♦getRegularizationVectorQ 


OrderBoundedTimeSeries 
^dataSeries :  Zmat  =  null 
^minimumOrder :  int  =  0 
^maximumOrder ;  int  =  0 

DrderBoundedTimeSeriesQ 

^etDataSeriesQ 

^etMinimumOrderQ 

^etMaximumOrderQ 


SchwarzCriterionVector 

^schwarzValue :  Zmat  =  null 
^logDetCovarianceMatrices  ;  Zmat  =  null 
^countsOfParameterVectorsAtGivenOrders :  Zmat  =  null 

♦  SchwarzCriterion  VectorQ 
^getSchwarzValueQ 
♦getLogDetCovarianceMatricesQ 
^getCountsOfParameterVectorsAtGivenOrdersQ 


MultivariateAutoregressiveModel 

♦modelQ 

♦  computeARmodelQ 
♦computeQRofARQ 
%  computeOptimalOrderQ 
♦displayModelOnConsoleQ 


I _  Model  Parameters _ 

^  optimalOrder :  int  =  0 
^  interceptVector :  Zmat  =  null 
^  coefficientMatrices ;  Vector  =  null 
^  noiseCovaiianceMatrix :  Zmat  =  null 
^  schwarzValue :  SchwarzCriterionVector  =  null 
^  forConfidenceAndSpectral :  Zmat  =  null 

♦  ModelParametersQ 

♦  getOptimalOrderQ 

♦  getlnterceptVectorQ:  ZmatQ 

♦  getCoefficientMatricesQ:  VectorQ 

♦  getNoiseCovarianceMatrixQ 

♦  getSchwarzValueQ 

4  getForConfidenceAndSpectralQ 


As  depicted  in  Exhibit  8,  the  object  model  for  vector  autoregression  consists  of  six  main  classes: 
OrderBoundedTimeSeries,  RegularizedUpperTrianguiarQR,  DimensionedUpperTriangularQR, 
SchwarzCriterionVector,  ModelParameters,  and  MultivariateAutoregressiveModel.  The  latter 
class  contains  the  bulk  of  our  algorithms.  The  first  five  classes  encapsulated  intermediate  data 
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that  were  passed  between  the  methods  of  the  MultivariateAutoregressiveModel  class,  which  were 
made  static  by  design. 

Details  of  the  AR  model  algorithms  are  given  in  Section  2.4.3.  However,  the  following  items 
were  noted  regarding  the  above  object  model.  First,  the  following  sequence  diagram  could 
summarize  the  interaction  between  the  AR  modeler  and  its  clients. 

Exhibit  9.  UML  Interaction  Diagram  between  AR  and  Its  Clients 


Client  to  AR : 

Vector  AR  modeler :  Multivariate 

AnvCIient 

AutorearessiveModel 

model(double[],  int,  int) 


ModelParameters 

- 


The  arguments  passed  by  a  client  were  the  (possibly  multivariate)  time  series,  specified  as  a  two- 
dimensional  array,  and  the  minimum  and  maximum  orders  desired,  passed  as  integers.  Passing  a 
lower  and  upper  bound  made  the  algorithm  so  flexible  as  to  be  applicable  to  different  use  case 
scenarios.  For  example,  if  the  client  had  no  additional  information  about  the  true  order  of  the 
AR  model  sought,  it  would  use  zero  as  minimum  order  and  a  high  enough  positive  number  (i.e., 
50)  as  maximum  order.  On  the  other  hand,  if  the  client  were  looking  for  a  linear  trend,  for 
example,  it  would  set  both  minimum  and  maximum  orders  to  one. 

Next,  the  Zmat  class  played  a  prominent  place  in  both  the  object  model  and  the  implementation 
algorithms.  Zmat  is  the  core  class  of  the  Jampack  numerical  package.  It  refers  to  a  matrix  with 
complex  entries,  with  methods  for  manipulating  submatrices.  The  design  of  Jampack  calls  for 
familiar  matrix  operations  to  be  implemented  in  suites,  which  are  simply  Java  classes.  For 
example,  the  LU  decomposition  was  done  with  the  Zludpp  suite,  more  precisely  in  the 
constructor  of  the  Zludpp  class,  with  the  Zmat  object  being  LU-decomposed  as  argument. 

Next,  the  methods  of  the  MultivariateAutoregressiveModel  class  were  all  static,  in  the  Java 
sense.  This  was  in  accordance  with  the  general  philosophy  behind  the  Jampack  package  itself. 
This  technique  saved  us  from  passing  heavy  objects  around  our  algorithms. 
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Finally,  the  display ModelOnConsole()  method  of  Multivariate AutoregressiveModel  was  just  a 
trivial  example  for  showing  the  calculated  parameters  of  a  vector  AR  model.  It  was  actually  up 
to  the  clients  of  the  class  to  decide  how  to  display  these  parameters. 


2.3.2  Graphical  User  Interface 


The  graphical  user  interface  (GUI)  underwent  many  changes.  The  GUI  provides  not  only  the 
point  and  click  interface  that  all  users  are  accustomed  to,  but  also  a  graphical  depiction  of  some 
of  the  more  complicated  aspects  of  network  management  and  network  modeling.  All  of  this 
information  is  obviously  related  to  traffic  on  the  network.  Since  the  raw  traffic  information  is 
required  at  almost  every  level  of  the  RTNM,  it  made  sense  to  design  an  architecture  with  a 
central  data  collector. 

The  CollectorManager  is,  as  it  would  seem,  responsible  for  the  collection  of  traffic  data.  With 
these  data  we  could  determine  things  such  as  the  number  of  packets  in,  packets  out,  packets 
dropped,  packets  per  second,  and  so  on.  The  raw  data  were  collected  by  the  CollectorManager, 
which  passed  this  information  to  each  of  the  objects  responsible  for  calculating  the  information 
we  needed  (see  Exhibit  10). 

Exhibit  10.  RTNM  Data  Collection  Overview 
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These  objects  were  responsible  for  all  the  information  generated  by  the  RTNM,  including 
network  topology,  traffic,  modeling  and  prediction,  and  access  to  common  network  management 
tools.  The  network  view  provided  a  “top  level”  view  of  the  network  or  a  region  of  the  network 
(Exhibit  11).  This  allowed  the  network  administrator  to  see  the  status  of  the  network  at  a  glance. 
Changing  colors  indicated  node  and  link  utilization.  A  pie  chart  representing  the  current  region 
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indicated  the  relative  levels  of  utilization  for  this  view  (low,  light,  medium,  high,  down,  and 
unmanageable). 


Exhibit  1 1 .  The  Network  View 
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By  clicking  on  a  specific  node,  one  could  drill  down  and  retrieve  more  specific  information 
about  that  node.  This  drill  down  view  presented  the  user  with  what  we  call  the  node  monitor 
view  (Exhibit  12).  The  node  monitor  view  provided  a  traffic  level  view  of  a  particular  node  on 
the  network.  This  would  usually  be  a  switch  or  router,  although  it  could  be  a  workstation  acting 
as  a  router  or  server.  At  this  level,  one  could  inspect  the  actual  number  of  packets  going  in  and 
out  of  this  node.  The  traffic  is  shown  for  both  the  interface  and  Internet  protocol  (IP)  layer. 

This  view  also  shows  a  graph  of  the  CUSUM  model  used  in  the  determination  of  network  traffic 
change  points. 
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Exhibit  12.  The  Node  Monitor 


Although  the  RTNM  program  provides  this  specialized  network  information,  no  network 
management  package  would  be  complete  without  some  basic  network  troubleshooting  tools. 
The  RTNM  provides  a  database  browser,  which  provides  a  simple  way  of  monitoring  and 
changing  information  being  stored  about  the  network.  This  means  the  users  need  not  concern 
themselves  with  database  management  or  SQL  statements  (see  Exhibit  13). 


Exhibit  13.  Database  Browser 


/ 
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The  RTNM  package  also  provides  access  to  the  Ping  and  Traceroute  tools,  which  are  considered 
standard  troubleshooting  mechanisms.  Our  tools  have  a  friendlier  graphical  interface  added  to 
them.  This  provides  ease  of  use  and  eliminates  the  need  to  log  into  a  separate  shell  in  order  to 
run  these  commands,  as  shown  in  Exhibit  14. 


Exhibit  14.  Screenshot  of  Ping  and  Traceroute  Tools 


2.3.3  SNMP  Network  Management  with  JDMK 

JDMK  (Java  Dynamic  Management  Kit)  is  being  used  as  a  foundation  for  SNMP  support. 
JMAPI  (Java  Management  Application  Programmer’s  Interface)  was  the  original  API  that  was 
going  to  be  used  on  this  effort.  However,  Sun  dropped  JMAPI  and  started  working  with  JMX 
(Java  Management  extension).  An  evaluation  of  other  management  Java  APIs  for  network 
management  was  completed  (see  Table  8).  Because  JDMK  supported  Sun’s  new  JMX  effort,  we 
used  JDMK. 


-37- 


Table  8.  Evaluation  of  Management  Java  APIs 
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The  JDMK  supports  SNMP  vl  and  SNMP  v2  (including  bulk  requests).  Using  the  SNMP 
Manager  API  provided  with  JDMK  we  were  able  to: 

□  Send  SNMP  requests  (and  receive  responses) 

□  Receive  SNMP  traps 

□  Query  SNMP  nodes  in  synchronous  and  asynchronous  modes 

□  Translate  OIDs  (Object  Identification  numbers)  (using  the  dictionary  produced  by 
mibgen) 

This  API  and  its  functionality  provided  the  Real  Time  Network  Manager  with  fuller  network 
management  support,  more  specifically  more  robust  SNMP  support. 


Our  prototype  created  for  Phase  I  of  the  SBIR  included  the  ability  to  send  SNMP  requests  and 
receive  their  responses.  This  proof  of  concept  only  supported  SNMP  VI.  The  current  version  of 
JDMK  4.2  supports  SNMP  VI  and  V2c  and  allows  the  SBIR  Phase  II  application,  RTNM,  to 
support  a  wider  range  of  nodes  with  the  addition  of  SNMP  V2c. 


JDMK  provides  support  for  SNMP  Traps.  By  receiving  SNMP  Traps  RTNM  can  become  a  true 
network  management  system.  Currently  RTNM  can  collect,  store,  and  display  traps  sent  from 
correctly  configured  nodes  (Exhibit  15).  This  information  may  be  used  as  an  input  to  our  expert 
system  to  inform  the  network  administrator  of  problems  that  may  be  occurring. 

Exhibit  15.  The  SNMP  Trap  Browser 
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RTNM  has  the  ability  to  query  multiple  nodes  at  once  with  the  addition  of  asynchronous 
communications.  Our  Phase  I  prototype  implementation  used  synchronous  querying  of  nodes  as 
JMAPI  support  was  limited.  This  process  allowed  RTNM  to  query  a  single  node  at  a  time. 
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prohibiting  a  true  picture  of  the  entire  network  at  once.  Since  we  were  using  JDMK  we  could 
query  multiple  systems  at  once,  thereby  making  our  query  process  run  concurrently  and  allowing 
our  displays  and  data  to  be  a  true  snapshot  of  the  network. 

We  used  information  from  MIB  11  (defined  in  RFC  1213)  that  contained  many  OIDs.  Working 
with  OIDs  can  be  very  confusing.  For  example,  the  variable  ifInOctets  was  ODD 
1.3.6.1.2.1.2.2.1.10.  JDMK  translated  the  OIDs  into  useful  variables  so  one  could  just  ask  for 
ifInOctets  for  interface  1  as  easily  as  using  ifInOctets.l.  This  provided  an  easy  way  to  read  and 
write,  as  well  as  to  debug  code.  JDMK  provided  this  ability  to  translate  OIDs. 

These  added  features  provided  a  fuller  support  of  SNMP  in  RTNM.  RTNM  has  the  ability  to 
collect  all  physical  and  network  layer  OIDs  at  defined  time  intervals.  This  transpired  because 
JDMK  supplied  RTNM  with  a  robust  API  for  gathering  SNMP  information  about  a  network. 


2.3.4  Model  Based  Diagnostics 


Problem  diagnosis  in  any  application  domain,  in  particular  network  management,  is  an 
intractable  problem  in  general.  A  given  observed  symptom  may  have  been  produced  by  a 
myriad  of  root  causes.  In  practice,  we  usually  proceeded  by  iteration,  starting  off  with  a 
collection  T  of  the  most  likely  causes.  Then  for  each  iteration,  we  removed  as  many  unlikely 
causes  from  Y  as  possible,  through  appropriate  (usually  stochastic)  elimination  techniques.  We 
were  looking  at  integrating  vector  autoregressive  modeling  with  two  such  techniques,  namely 
Bayesian  belief  inference  and  vector  time  series  causality  tests. 

For  a  given  pair  (S,M),  where  S  is  a  set  of  nodes  (routers  or  hosts)  and  M  is  a  set  of  Management 
Information  Base  (MIB)  variables  pertaining  to  S,  a  Bayesian  belief  network,  such  as  the  one 
shown  in  Exhibit  16,  could  be  constructed.  The  nodes  in  the  network  refer  to  the  possibly  lagged 
variables  of  M.  There  is  an  arc  Xi.t-a  ->  Xj,t.b  when  it  is  believed  that  Xi,t.a  significantly  causes 
Xj,t-b  directly,  that  is,  when  the  conditional  probabilities 

p(  Xj.t.b  =  V  I  Xi,t-a  =  w  ) 

are  deemed  non-negligible.  In  our  current  thinking,  data  mining  techniques  could  be  used  to 
build  the  belief  network.  First,  the  possible  values  of  each  Xi,t^;  should  be  quantified  to  a  small 
number  of  levels  v.  Then,  a  counting  of  observed  values  could  be  performed  to  produce 


P(Xj,t-b  =  V  I  Xi,t-a  =  w) 


Xj.t-b  =  V  n  Xi.t-a  =  w 
|Xi,t-a  =  w| 


Once  a  belief  network  was  built,  it  could  be  used  to  collect  potential  causes  of  an  observed  value 
Xj,t-b  =  V  through  standard  Bayesian  inference  techniques.  The  main  issue  here  was  the  large 
number  of  lagged  Xi,t-a  and  the  intractability  of  Bayesian  network  inference  (except  in  very 
special  cases,  such  as  single  connectedness). 
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Exhibit  16.  Bayesian  Belief  Network 


Operational  statistical  causality  or  non-causality  tests  between  two  or  more  variables  of  vector 
AR  models  were  also  researched.  C.W.J.  Granger  introduced  in  1969  the  basic  framework  for 
non-causality  tests  in  the  linear  case.  This  framework  has  since  been  extended  in  various 
directions,  including  non-linear  and  non-parametric  models.  In  the  case  of  network 
management,  the  goal  was  to  devise  reliable  tests  that  permitted  the  elimination  of  as  many 
variables  Xj  as  possible  as  causes  of  the  observed  value  of  another  variable  Xj.  At  the  present 
time,  network  managers  manually  conduct  similar  “tests”  ad  hoc.  It  is  desirable  to  assume  as 
little  as  possible  about  the  “functional”  relationships  between  variables  as  such  relationships  are 
extremely  difficult  to  ascertain  in  the  case  of  networking,  and  in  fact  may  be  said  to  be  the 
problem  itself.  In  our  work,  we  also  imposed  the  additional  constraint  that  the  tests  were  easy 
and  efficient  to  compute,  as  our  goal  was  to  use  them  in  real-time  monitoring  scenarios.  Now, 
the  kth  component  of  the  vector  AR  equation 

Xt  =  W  +  XAiXt-i  +  et 
i=l 

may  be  written  as  follows,  with  the  obvious  meaning  for  the  indices 

p  n 

Xk,t  ~  Wk  X  (  X  Ai,k,j  Xj,t-i )  Bk,t 

i=l  j=l 

Hence,  the  null  hypothesis  that  Xr,t  does  not  linearly  cause  Xs,t+  is 

Ho  •  Ai^s,r  —  0,  A2,s,r  =  0^  •  •  •>  Ap,s,r  —  0 

Of  course,  the  causation  relationship  between  two  variables,  if  any,  need  not  be  linear  at  all.  In 
the  linear  case  for  two  variables  only,  we  designed  a  procedure  consisting  of  two  regressions  and 
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the  computation  of  two  statistics  with  F-distribution  and  x^-distribution,  respectively,  under 
hypothesis  Hq.  In  the  general  case,  we  attempted  to  imitate  the  linear  case,  using  Taylor  series 
expansion.  The  procedure  and  the  obtained  results  require  further  study. 


We  investigated  a  possible  use  of  Fourier  descriptors  from  the  vector  AR  model  equation 

Xt  =  W  +  iAiXt-i  +  et  (27) 
i=l 


The  vector  Xt  of  equation  (27)  may  be  viewed  as  the  parametric  equations  of  a  hypercurve  in 
dimension  m.  Information  was  extracted  from  this  hypercurve  through  appropriate  functions 
defined  over  its  points.  The  first  function  of  interest  to  us  was  the  curvature  K(t),  which  is  the 
magnitude  of  the  rate  of  change  of  the  tangent  vector  with  respect  to  the  arc  length.  That  is,  if 


T  =  ’  ds  =  ^dXt*^Xt 

dt 


then. 


K(t)  = 


dT 

dT 

dt 

ds 

ds 

dt 

Fourier  descriptors  were  obtained  as  follows.  First,  samples  of  K(t)  were  transformed  through  a 
Fast  Fourier  Transform  (FFT).  Second,  as  Xt  was  real,  the  Fourier  coefficients  were  identical  in 
both  the  positive  and  negative  frequency  axes;  hence  descriptors  were  obtained  by  dividing  the 
absolute  value  of  each  positive  frequency  coefficient  by  the  absolute  value  of  the  DC  (direct 
current)  coefficient. 

Autoregression  descriptors  could  also  be  obtained  for  K(t).  Applying  equation  (27)  to  the 
curvature,  we  obtained  the  following  formulation 

K(t)  =  w  +  SaiK(t-i)  + 
i=l 


where  the  residual  was  written  in  the  more  convenient  form  ^Pat  >  which  is  possible  because  of 
the  single  dimension.  We  could  then  take  the  sequence  (ai,a2» -••>ah’-^)  as  feature  vector.  In 

VP 

many  particular  cases  there  were  other  possible  curve  functions  in  addition  to  the  curvature.  For 
example,  if  m  =  2,  we  could  consider  the  centroid  O  of  the  shape  obtained  by  “completing”  the 
curve  Xt  by  symmetry  around  the  line  joining  its  end  points.  The  function  p(t)  given  by 

p(t)  =  distance  between  Q.  and  Xt 

then  produced  feature  vectors  via  FFT  and  autoregression. 
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Likewise,  still  in  the  case  m  =  2,  the  two  components  of  Xt  could  be  viewed  as  the  real  and 
imaginary  parts  of  a  complex  number  z.  The  latter  could  be  Fourier  transformed  directly.  Here, 
the  DC  component  could  be  dropped,  but  the  coefficients  in  the  negative  frequency  axes  were 
needed.  The  sequence  of  non-DC  Fourier  coefficients  scaled  by  the  absolute  value  of  the  first 
non-zero  non-DC  coefficient  produced  the  feature  vector.  Our  experiments  showed  us  that  the 
curvature  function  seemed  to  have  the  best  performance. 


2.3.5  Future  Diagnostics  Input 


The  implemented  vector  autoregressive  modeling  in  Sections  2.3.1,  2.4.3,  and  2.4.4  could  form  a 
basis  for  prediction  and  diagnosis  capabilities  for  our  software  prototype.  Several  interrelated 
research  areas  were  explored  regarding  the  significance  of  the  vector  AR  modeling. 

First,  a  vector  AR  model  could  be  used  as  a  point  in  an  operating  space.  More  precisely,  for  any 
entity  E  to  be  monitored  (i.e.,  a  single  device,  a  subregion  of  a  network,  or  an  entire  network),  we 
could  define  an  associated  topological  space  Or,  where  each  point  represented  a  state  in  the  life 
cycle  of  entity  E. 

A  serialized  AR  model  was  an  obvious  candidate  for  implementing  such  a  point.  With  the 
notation  of  our  previous  report,  if  the  current  AR  model  of  entity  E  is  given  by 

X,  =  W  +  (28) 

i=l 

then,  by  adopting  a  simple  row/column  serialization  scheme  (row  1,  followed  by  row  2,  followed 
by  row  3,  etc.)  for  the  model  vectors  W,  C,  and  A,  (where  C  is  the  covariance  matrix  for  the  noise 
component  £i ),  a  numerical  representation  of  the  state  of  E  was  obtained.  There  is  an  obvious 
metric  on  the  space  Or,  namely  the  distance  between  the  two  serialized  representing  vectors, 
which  is  the  norm  of  their  difference. 

There  was  an  obvious  issue  regarding  the  above  proposal,  viz.  the  vector  AR  model  need  not 
always  be  of  the  same  order  p,  throughout  the  life  cycle  of  entity  E.  Indeed,  if  the  order  was 
always  the  same,  then  the  topological  space  Or  would  in  fact  be  a  normed  vector  space.  A 
practical  but  not  necessarily  satisfactory  solution  was  to  force  the  order  to  be  always  the  same. 
For  example,  if  we  were  interested  in  trends  in  the  monitoring  activities,  we  could,  as  a  first 
approximation,  assume  order  1.  The  design  of  our  vector  AR  modeler  indeed  permitted  such  a 
choice.  A  more  formal  approach  defined  a  metric  dist  on  Or  as  follows. 

A  point  X  &  Or  was  viewed  as  a  pair  (p,X),  where  p  was  the  order  of  the  vector  AR  model 
corresponding  to  x,  and  X  was  the  serialized  representation  of  this  AR  model,  as  described 
above. 

Then,  for  any  points  x  =  (p,X),  x’  =  (p’,  ]£_)  and  x”  =  (p  ’,X")  in  Or,  define 

dist{x,x')=  oo  if  p  ^  p’ 
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dist{x,x')=  II  X  -  2£1  II  if  P=  P’ 

where  the  vector  norm  could  be  any  norm,  since  it  was  well  known  that  on  a  finite-dimensional 
vector  space,  all  norms  were  topologically  equivalent.  In  an  actual  software  implementation,  we 
would  of  course  select  a  computationally  inexpensive  norm.  It  was  a  straightforward  matter  to 
check  that  dist  was  a  valid  metric,  in  the  mathematical  sense,  namely  that  it  was  symmetric, 
satisfied  the  triangular  inequality,  and  for  any  x  e  Oe,  x  was  the  only  point  whose  distance  to  x 
was  null. 

Practical  values  of  the  topological  space  Oe  would  emerge  only  if  we  could  attach  network 
management  semantics  to  the  points  of  the  space  Oe-  First,  the  lifecycle  of  entity  E  became  a 
trajectory  within  the  space  Oe-  In  Exhibit  17,  we  show  this  lifecycle  trajectory  as  a  continuous 
line.  Second,  a  point  may  represent  a  normal  behavior  of  entity  E,  or  a  pre-fault  behavior 
(depicted  as  an  oval  in  Exhibit  17),  a  fault  behavior  (depicted  as  a  diamond),  or  a  post-fault 
behavior  (depicted  as  a  rectangle).  Then,  for  example,  if  during  real-time  monitoring,  the  state 
being  observed  was  close  to  that  of  a  pre-fault  or  fault  behavior,  according  to  the  above  metric, 
appropriate  alarms  might  be  thrown.  The  research  problem  here  was  the  characterization  of  a 
vector  AR  model  in  terms  of  faults  or  precursors  to  faults.  An  experimental  task  we  conducted 
was  the  modeling  of  the  same  network  at  the  same  time  of  the  day,  for  several  days. 


Exhibit  17.  Operating  Space  as  a  Topoiogicai  Space 


We  noted  that  the  space  Oe  was  a  compressed  view  (more  accurately,  a  flattened  view)  of  the  life 
of  a  managed  entity.  Indeed,  the  vector  AR  model  did  not  tell  the  entire  story.  As  a  trivial 
example,  consider  the  case  of  a  one-dimensional  model  x  =  0.08  +  0.12  t  for  the  number  of 
packets  dropped  by  entity  E.  For  small  values  for  time  t,  the  value  of  x  probably  signified 
nothing.  However,  for  large  values  of  t,  trouble-indicating  thresholds  would  be  exceeded.  The 
only  thing  we  could  say  is  that  the  model  was  a  precursor  to  trouble,  as  it  was  an  increasing 
function  of  t,  for  a  quantity  that  could  not  grow  indefinitely.  Whether  it  actually  signified  fault, 
or  closeness  to  fault,  must  be  ascertained  by  the  value  provided  by  x  on  the  time  dimension. 

In  general,  the  extra  time  dimensions,  corresponding  to  evolution  in  time  of  the  same  model, 
were  given  by  X,  in  equation  (28),  or  more  precisely,  the  columns  of  Xt-  Our  software  prototype 
was  envisioned  to  have  to  watch  both  the  space  Oe  for  change  of  model,  and  the  time  dimensions 
for  the  evolution  of  the  same  model  (Exhibit  18).  The  black  line  segments  emanating  from  the 


-44- 


Report  No.  99902-000-A004 


Oe  Space  represent  levels  of  a  state  space  variable  (i.e.,  a  column  of  Xt).  Of  course,  when  the 
state  space  dimension  was  larger  than  1,  we  could  not  really  provide  a  three-dimensional 
depiction  of  the  lifecycle  of  the  managed  entity  E.  Diagnosis  through  the  vector  AR  models 
might  also  be  enhanced  with  rules  and  stochastic  causal  analysis.  Work  in  this  area  should  be 
continued. 


Exhibit  18.  Topological  Operating  Space  with  Time  Dimension 


2.3.6  Study  of  Mobile  Agents 


The  two  well-known  standards  for  agent-based  computing,  OMG  Mobile  Agent  Software 
(MASIF)  and  Foundation  for  Intelligent  Physical  Agents  (FIPA),  were  considered  for  their 
relevance  to  network  monitoring  in  general,  and  to  our  software  prototype  in  particular.  We 
looked  at  two  FIPA  implementations.  Jade  from  CSELT  of  Torino,  Italy,  and  FIPA-OS  from 
Nortel  Networks  of  Ontario,  Canada.  We  also  considered  more  general  frameworks  that 
subsumed  both  MASIF  and  FIPA.  In  this  regard,  we  experimented  with  the  Grasshopper  system 
from  IKV++  GmbH,  Germany,  which  permits  both  MASIF  and  FIPA  plug-ins. 

After  some  preliminary  experimental  work,  we  decided  not  to  pursue  the  FIPA-OS  route  but  to 
retain  the  Jade  system.  The  main  reason  was  that  we  were  not  able  to  exercise  agents  living  on 
multiple  platforms  using  FIPA-OS.  The  Jade  system  appeared  suitable  for  two  main  concerns  of 
network  monitoring  and  management:  (1)  the  large  amount  of  network  traffic  data,  and  (2)  the 
heterogeneity  of  the  network  operating  environment. 

Regarding  the  second  issue.  Jade  permitted  devices  and  machines  running  different  operating 
systems  to  host  interoperable  agents,  as  long  as  they  could  run  a  Java  virtual  machine,  as  Jade  is 
entirely  written  in  Java.  Jade,  like  any  FIPA-compliant  agent  system,  allows  a  flexible  logical 
partitioning  of  cooperating  agents,  within  a  network  management  domain. 

In  Exhibit  19,  we  show  this  partitioning,  the  cubes  representing  entities  to  manage  and  to  be 
managed.  The  dashed  parallelograms  are  what  FIPA  and  Jade  call  agent  platforms,  the  full 
skinny  arrows  represent  Java  Remote  Method  Invocation  (RMI)  communication  between  agents 


-45- 


Report  No.  99902 -000- A004 


belonging  to  the  same  platform,  and  the  fat  arrows  signify  CORBA  HOP  communication 
between  agents  contained  in  separate  platforms.  This  division  scheme  then  provides  different 
levels  of  granularity,  in  terms  of  functionality  and  management  requirements. 

Exhibit  19.  Jade  Piatforms  for  Network  Management 


Platform  I 


Platform  2 


A  FIPA  agent,  i.e.,  a  Jade  agent,  always  executed  within  an  agent  container,  which  in  Jade  is  a 
Java  virtual  machine.  An  agent  platform  was  made  of  one  or  more  containers,  which  could  span 
multiple  devices  and  hosts,  as  shown  in  Exhibit  19.  We  could  roughly  map  the  concept  of 
platform  to  that  of  a  management  domain.  A  benefit  of  this  mapping  followed  from  the  use  of 
Java  RMI  as  object  communication  framework,  in  lieu  of  the  more  expensive  language- 
independent  CORBA  HOP.  An  enterprise  network  would  consist  of  one  or  more  agent 
platforms. 

We  experimented  with  the  Jade  FIPA  and  the  Grasshopper  agent  systems.  Agent  systems  allow 
various  nodes  on  the  network  to  share  in  the  monitoring  and  management  of  the  network,  as 
opposed  to  a  central  management  station  querying  each  node  and  making  all  decisions.  We 
concentrated  on  the  FIFA  and  MASIF  plug-ins  in  Grasshopper.  Although  the  Grasshopper  agent 
system  was  very  attractive  for  general  agent  computing,  we  felt  that  it  was  too  demanding  in 
resources  for  our  purposes.  We  did  not,  therefore,  pursue  it  further,  and  returned  to  the  Jade 
FIPA  system.  These  systems,  however,  were  agents  and  therefore  they  had  to  be  installed  on 
network  devices.  Unfortunately,  even  though  these  types  of  agents  would  be  extremely  useful, 
the  reality  was  that  most  network  managers  were  reluctant  to  install  software  on  import  network 
devices. 
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2.3.7  Network  Management  Load  Distribution 


Management  load  across  a  network  that  is  supposed  to  be  managed  and  controlled  in  real  time  is 
both  spatial  and  temporal.  Local  observables  include  all  that  pertain  to  a  given  node  i.e.,  traffic 
in  and  out  of  the  node,  the  server’s  rate  at  that  node,  the  effective  buffer  volume  available  at  that 
node,  the  AR  and  truncated  trend  line  features  of  state  variables  at  that  node,  the  model-change 
time  point  computations,  and  model  characteristics  at  that  node.  Also,  at  each  node,  appropriate 
agents  at  that  node  would  carry  out  relevant  regional  computations.  If  a  radius  of  k-hops  was 
construed  to  be  the  maximum  spread  of  a  manageable  region,  then  for  all  region-definitions  from 
1  to  k  hops  (both  inclusive)  we  would  have  to  compute  regional  data  at  that  node. 

For  computational  ease,  the  entire  manageable  region  within  h-hops  from  a  super-manager  node 
(or  the  controller  node)  could  be  arranged  in  a  hierarchical  fashion.  If  a  d-depth  tree  was  sought, 
at  each  level  except  for  the  leaf  nodes,  a  regional  controller  would  be  defined  with  roughly 
subregions  under  its  control.  Management  load  would  be  distributed  over  regional  and 
subregional  controllers. 

For  our  purpose,  we  configured  the  system  with  a  finite  number  of  non-overlapping  regions  each 
of  which  was  roughly  within  some  k-hops  radius  from  the  regional  controller.  All  these 
controllers  periodically  sent  their  computed  regional  and  local  data  to  the  network  super¬ 
controller  (the  observer).  It  was  incumbent  upon  the  latter  to  report  functionality  of  the  network 
to  its  clients  particularly  when  the  network’s  traffic  rate  changed,  congestion  (local  and/or 
distant)  developed,  or  improved  to  a  normalcy. 

For  a  node,  not  only  did  its  network  neighborhood  need  to  be  ascertained  but  its  temporal 
dimension  had  to  be  addressed  as  well.  How  long  should  a  node  retain  its  data  (raw  and 
computed)?  How  long  should  the  regional  data  be  stored  at  a  node?  Our  preliminary  finding 
attempted  to  conceptualize  it  in  the  following  way.  In  a  dynamic  and  an  adaptive  system,  the 
measurement  of  system  states  takes  a  crucial  dimension  since  the  window  of  opportunity  to 
amend  a  measurement  may  often  be  narrow.  For  a  system  that  remains  invariant  in  time,  one  has 
an  infinitely  long  window  to  improve  upon  measurements  before  it  is  reported.  In  our  system, 
we  noted  that  some  local  state  information  at  a  time  t  might  be  required  for  the  following 
reasons: 

□  To  project  a  measure  for  the  same  state  at  a  later  time  t  +  A 

□  To  project  a  measure  for  a  regional  state  (region  being  the  network  neighborhood 
within  k  hops)  at  a  future  time  /  +  0 ,  i.e.,  regional  congestion  level  measures,  etc. 

Therefore,  a  local  system  state  at  a  node  had  to  be  available  for  at  least  r,  =  max{  )  units 
of  time  if  it  were  required  for  further  computation.  For  convenience,  we  defined  the  following. 

Residency  or  currency  of  an  instance  of  an  observable  (of  a  data)  is  the  amount  of  time  it  must 
be  stored  for  future  reference.  Currency  of  a  data  e  obtained  at  a  time  t  was  its  storage  duration. 
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All  our  local  variables  could  be  partitioned  into  the  following  sets  based  on  their  individual 
systemic  contributions,  which  in  turn  predicated  their  residency  times.  These  sets  were  as 
follows. 

2.3. 7.1  T  (Transitional) 

The  variables  in  this  set  were  derived  and  used  (reported).  A  raw  variable  was  not  kept  in  the 
store  after  one  obtained  its  statistics.  For  instance,  if  9  was  such  a  variable  then  its  expected 
value  and  variance  jUg  =  E(6)  and  al  =  var(  9 )  might  be  retained  for  a  longer  time  in  lieu  of  the 
variable  9 .  Let  us  assume  a  policy  that  retains  minimally  last  ng  such  pair  of  values.  These 
values  could  be  aggregated  (if  aggregation  is  possible)  to  a  higher  level  of  abstraction  yielding  a 
model  of  type 


^g  .*  Jilg  — ^  C  or  Eg  t  O' g  — ^  d 

Obviously,  in  such  cases  we  would  retain  the  functions  Qg  and  Eg  with  longer  currency. 

2.3.7. 2  QT  (Quasi-transitional) 

If  9eQT  then,  its  residency  was  determined  in  the  following  way.  Let  f:0-^gbea  function, 
which  aptly  described  some  as  a  function  of  0 .  If  the  current  9  was  congruent  to  the  function 
/,  we  did  not  store  9  but,  instead,  stored  the  function /specification  on  the  understanding  that  an 
inverse  mapping  would  be  able  to  restore  9  (with  some  loss  of  precision)  if  we  wanted  to  regain 
it.  In  this  case,  function  form/ would  be  stored  for  some  time  Af . 

If  9  was  incongruent  to  what  the  current  accepted  functional  form  /  suggested,  the  latest  value 
for  9  would  be  stored.  It  was  possible  that  such  a  sampled  value  might  appear  as  a  noise  spike 
or  it  might  be  a  genuine  change  in  the  value  of  9 .  However,  this  could  only  be  ascertained  if  a 
sequence  of  successive  sampled  values  on  9  showed  a  departure  from  the  currently  in  use 
functional  form  /  to  some  other  form  /.  Therefore,  for  data  in  QT  set,  we  suggested  the 
following. 

Given  and  the  currently  applicable  functional  form  f:9(t)-^gfox  all 

<  t  <  /,  excluding  the  sampled  value  6»,  on  the  ground  that  /( 9^  )*gi,  collate  all  successive 
sampled  into  another  model  given  by  f':9^^ until  for  some  9  =  9„.  Thus,  the  currency  of 
the  variable  9  was  then  the  interval  for  which  the  old  functional  form  was  not  acceptable  and  the 
new  one  was  as  yet  unregistered,  i.e., 

C(9)  =  t„  -ti  when  tj  =  min( t )  given  that  f(9(t  ))^g(t )  and 
t„  =  min(  t )  given  that  f'(9(t))^^(t) 

In  QT  grouping,  we  would  also  have  those  variables  entering  for  CUSUM  analysis.  Suppose 
that  the  variable  9  was  monitored  at  a  node  using  its  CUSUM  statistics.  At  the  beginning  of  a 
new  model  epoch  we  would  monitor  9  until  the  next  model  change  point.  Therefore,  within  that 
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epoch  the  only  information  of  interest  would  be  its  expected  value  and  its  variance,  but  until 
these  were  obtained  we  would  keep  the  raw  9  values  sampled  and  collected  thus  far.  These 
might  be  just  the  last  m  set  of  raw  9  values  until  a  new  model  emerged. 

An  entity  might  require  p  sets  of  states  to  provide  a  measure.  For  instance,  a  link  was  defined 
between  two  nodes  and,  therefore,  any  performance  measure  of  a  link  would  require  two  state 
values.  The  currency  of  a  link  state  was  then  at  least  as  large  as  the  minimum  of  the  two  state 
currencies.  In  this  way,  we  could  define  the  currency  of  a  p-state  entity  (one  needed  at  least  p 
states  from  p  distinct  nodes  comprising  the  entity)  as 

C(9\p- states  dependency)  =  min  Cj( 9) 

The  frequency  of  information  exchange  between  regional  nodes  and  the  super-node  would  be 
predicated  by  this  measure.  The  currency  measure  could  not  be  too  large  lest  it  lose  its  relevance 
in  the  face  of  real-time  characteristic  of  the  observable  domain. 


2.3.8  Scalability  Issues  of  Traffic  Models 


We  continued  looking  at  scalability  issues  of  traffic  models,  especially  our  own  vector  AR 
models.  We  were  trying  to  determine  how  we  could  quantify  network  traffic  as  it  related  to  a 
subnet  or  a  group  of  nodes  on  the  network  instead  of  a  single  network  device.  For  example,  what 
if  we  wanted  to  know  the  utilization  of  a  group  of  routers,  but  not  switches?  We  currently  have 
the  ability  to  model  an  individual  node  based  on  multiple  variables.  In  other  words  the  AR 
model  can  be  based  on,  say  the  traffic  in,  the  traffic  out,  and  the  number  of  errors.  One  way  of 
scaling  this  is  to  take  the  columns  of  the  coefficient  matrices  corresponding  to  multiple  variables 
of  the  same  node  as  corresponding  to  single  variable  multiple  nodes.  For  example,  instead  of 
having  traffic  in,  traffic  out,  and  traffic  errors  for  a  single  node,  we  could  use  traffic  in  from  three 
separate  nodes.  This  appears  to  provide  a  useful  measure  of  network  activity  but  has  not  yet 
been  incorporated  into  the  RTNM. 


2.3.9  Alternative  Frameworks 

As  stated  previously,  we  experimented  with  various  SNMP  frameworks  such  as  AdventNet, 
JMAPI,  and  JDMK/JMX.  We  decided  to  use  the  JDMK.  The  JDMK  is  a  commercially 
supported  management  framework,  which  provides  the  most  reliable  SNMP  management  we 
have  found  so  far. 

JDMK  also  provides  a  framework  for  using  agents  and  management  beans  or  “Mbeans”  to 
dynamically  alter  the  actions  an  agent  performs,  or  the  way  the  agent  performs.  JDMK  also 
allows  agent  to  agent  communication.  These  abilities  of  JDMK  may  open  the  door  for  a  much 
more  distributed  and  extensible  management  system. 
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We  looked  at  several  open-source  network  management  projects,  including  GHOST  and 
OpenNMS.  None  of  these  used  a  mathematical  modeling  approach  for  extracting  information 
from  the  SNMP  time  series  data  that  they  collected  and  therefore,  were  not  used. 


2.4  TASK  4  -  SOFTWARE  SYSTEM  IMPLEMENTATION 
2.4.1  Software  Evaluations 


The  process  of  software  development  between  Phase  I  and  Phase  II  of  this  SBIR  varied  largely  in 
the  fact  that  no  software  design  tools  or  integrated  development  environment  (IDE)  were  used. 
Therefore  we  approached  the  development  by  evaluating  Java  IDEs  and  software  design  tools. 
We  selected  IBM  Visual  Age  for  Java  based  on  reviews  about  its  features  including  the 
debugger.  This  phase  II  project  was  much  larger  than  the  phase  I  proof  of  concept;  we  knew  we 
needed  the  features  that  IBM  Visual  Age  for  Java  eontained.  Table  9  shows  the  comparisons 
among  the  software  design  tools. 
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Table  9.  Comparison  of  Software  Design  Tools 
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2.4.2  SNMP  Data  Collection  Implementation 

An  efficient  data  collection  class  for  collecting  SNMP  data  from  multiple  nodes  was  imperative 
to  this  project.  Without  collected  data,  utilization  could  not  be  projected.  This  class  allowed  us 
to  query  nodes  with  precision  to  gather  information  about  nodes  that  we  were  querying.  We 
focused  our  efforts  on  routers  and  switches.  These  nodes  offered  us  the  most  traffic  as  well  as 
the  most  congested  areas.  We  could  expand  our  study  to  other  nodes  fairly  easily  once  we 
identified  models  as  well  as  procedures  for  selecting  models. 

The  information  collected  included  MIB  II  variables  from  the  physical  layer  up  the  to  the 
network  layer.  We  focused  on  the  IF  or  interface  layer  and  IP  or  network  layer.  The  data 
collected  from  the  IF  layer  is  listed  below. 

□  ifInOctets  (ifEntry  10)  Description-Total  number  of  octets  in  the  interface  for 
inbound  traffic,  including  the  framing  characters. 

□  ifInUcastPkts  (ifEntry  11)  Description— Total  number  of  subnetwork  unicast  packets 
delivered  to  a  higher  level  protocol. 

□  ifInNUcastPkts  (ifEntry  12)  Description-Total  number  of  broadcast/multicast 
subnetwork  packets  delivered  to  a  higher  level  protocol. 

□  ifInDiscards  (ifEntry  13)  Description— Total  number  of  inbound  packets  discarded 
due  to  lack  of  buffer  space. 

□  ifInErrors  (ifEntry  14)  Description— Total  number  of  inbound  packets  discarded  due 
to  errors. 

□  ifInUnknownProtos  (ifEntry  15)  Description— Total  number  of  inbound  packets 
discarded  due  to  unknown  protocol  specification. 

□  ifOutOctets  (ifEntry  16)  Description— Total  number  of  octets  transmitted  to  the 
network  from  this  interface  (including  framing  characters). 

□  ifOutUcastPkts  (ifEntry  17)  Description— Same  as  for  inbound  traffic  but  now  with 
changed  direction. 

□  ifOutNUcastPkts  (ifEntry  18)  Description— Same  as  for  inbound  traffic  but  now  with 
changed  direction. 

□  ifOutDiscards  (ifEntry  19)  Description — Same  as  for  inbound  traffic  but  now  with 
changed  direction. 

□  ifOutErrors  (ifEntry  20)  Description — Same  as  for  inbound  traffic  but  now  with 
changed  direction. 
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□  ifOutQLen  (ifEntry  21)  Description-The  instantaneous  length  of  the  output  packet 
queue  in  packets. 

We  also  collected  information  on  the  TCP  and  UDP  stacks.  All  this  information  could  be  stored 
in  a  database  via  JDBC. 

Using  JDBC  API  for  MySQL,  we  linked  our  collection  routine  to  a  database.  We  used  Sybase  in 
Phase  I  because  it  was  available  at  BAE  SYSTEMS  and  had  been  used  in  the  Phase  I  prototype. 
We  extended  to  MySQL  because  it  was  freeware  and  had  support  for  JDBC.  The  freeware  DB 
allowed  us  to  have  multiple  test  networks  without  extra  cost. 

In  keeping  with  Java’s  “write  once,  run  anywhere”  theme,  we  ran  our  application  on  multiple 
platforms-Sun’s  JDK  on  NT  4.0,  Solaris  7,  and  Windows  95. 

The  network  discovery  algorithm  was  optimized.  Aside  from  these  changes,  there  were 
corrections  to  the  algorithm.  While  the  network  discovery  did  seem  to  work  effectively,  we 
discovered  that  some  of  the  SNMP  queries  sent  to  the  Asynchronous  Transfer  Mode  (ATM) 
switches  were  not  being  returned.  In  fact,  we  discovered  that  there  was  no  detection  when  the 
queries  were  not  being  returned.  This  was  very  important.  With  the  changes  we  made,  we 
solved  this  problem  and  achieved  an  extremely  complete  network  discovery.  On  one  network, 
for  example,  we  were  able  to  detect  almost  1,800  machines,  yet  only  124  of  them  were  SNMP- 
capable.  This  was  without  using  Ping  or  Traceroute  to  query  an  entire  block  of  IP  addresses  as 
most  discoveries  do. 


2.4.3  CUSUM  AND  AR  Design 

The  CUSUM  filter  Java  implementation  went  smoothly  based  on  our  earlier  Matlab 
implementation.  We  paid  particular  attention  to  the  interface  of  these  filters  to  other  components 
of  our  software.  Specifically,  we  expected  the  filters  to  be  called  in  real  time,  and  thus  pass  a 
reasonable  slice  of  traffic  measurements  from  data  collectors. 

Concerning  the  data  structures,  the  main  issue  in  implementing  the  object  model  was  the 
determination  of  the  optimal  data  run  length  of  traffic  time  series  that  achieves  an  acceptable 
real-time  calculation  of  model  change  points,  without  unduly  taxing  traffic  data  collection  and 
archive.  From  our  experimental  work,  a  length  of  about  ten  data  points  appeared  acceptable. 
For  analysis  purposes,  it  was  also  decided  to  record  the  collection  time  for  traffic  data.  Time  was 
stored  as  String  objects.  The  only  constructor  of  the  TimeSeriesData  class  of  the  object  model 
was  then  expected  to  pass  an  array  of  not  less  than  ten  strings,  and  an  array  of  not  less  than  ten 
double-precision  floating  point  numbers.  Another  object  class  that  encapsulated  both  data  and 
corresponding  times  could  have  been  designed.  However,  after  analysis,  we  reasoned  that  the 
design  elegance  did  not  pay  for  the  time  needed  to  pack  and  unpack  objects  of  this  encapsulating 
class. 

Regarding  the  CusumChangeData  class,  two  Boolean  flags  were  used  to  signal  a  client  whether 
the  object  it  received  from  the  CUSUM  filter  contained  a  new  mean  and/or  variance.  This 
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approach  was  necessary  as  the  filter  was  normally  called  with  the  same  number  of  data  points, 
which  might  or  might  not  produce  a  model  change.  If  changes  occurred,  their  time  labels  were 
returned  as  well.  The  CusumFilter  class  accumulated  data  and  corresponding  times  that  it 
received  from  clients.  We  used  a  length  of  a  thousand  and  never  exceeded  it.  The  only 
constructor  of  CusumFilter  took  in  initial  data,  which  were  used  to  compute  initial  values  for 
mean,  variance,  and  acceptable  shifts  for  mean  and  variance.  The  actual  algorithms  used  by  the 
filteri)  method  of  the  CusumFilter  class  were  given  in  Section  2.4.3. 

Exhibit  20  shows  a  somewhat  abused  UML  collaboration  diagram  for  the  vector  AR  modeler. 
Normally,  such  a  collaboration  diagram  is  meant  to  capture  the  interaction  between  objects  of  an 
UML  model.  Here,  we  used  it  to  show  the  interaction  between  the  static  methods  of 
MultivariateAutoregressiveModel,  which  placed  the  various  classes  used  in  the  AR  modeler  in  a 
better  perspective. 

Exhibit  20.  UML  Coiiaboration  Diagram  for  AR  Modeier 


Note  that,  with  the  numbering  scheme  of  Exhibit  20,  the  correct  sequence  of  calls  from  the  initial 
client  call  until  its  return  was  1-3-5-6-7-8-4-2.  The  OrderBoundedTimeSeries  class  was 
designed  to  allow  the  passing  of  the  time  series  data  as  a  Jampack  Zmat  object,  and  the  range  of 
putative  model  orders,  as  explained  in  Section  2.3.1. 

The  RegularizedUpperTriangularQR  class  held  the  upper  triangular  matrix,  i.e.,  the  R  part, 
resulting  from  the  QR  decomposition  of  a  matrix,  called  regularizedKmatrix,  obtained  from  the 
time  series  data  Kmatrix  regularized  by  a  regularization  vector  regVec.  This  latter  was 
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essentially  the  square  root  of  the  machine  epsilon  times  the  vector  computed  from  summing  each 
column  of  the  entry-wise  absolute  value  of  Kmatrix. 

The  DimensionedUpperTriangularQR  class  carried  an  upper  triangular  matrix  (supposedly  from 
QR  decomposition),  along  with  the  range  of  putative  model  orders.  In  addition,  it  carried  two 
integers  for  the  state  space  dimension  (i.e.,  the  number  of  columns  in  the  traffic  time  series)  and 
a  number  of  block  equations  (essentially,  the  number  of  time  steps  in  the  series  minus  the 
maximum  order). 

The  SchwarzCriterionVector  class  held  three  Zmat  objects,  the  first,  called  schwarzValue,  being 
the  only  one  of  relevance  to  the  current  implementation.  It  contained  a  list  of  values  for  a  given 
order  selection  criterion,  for  all  the  orders  in  the  putative  list.  Although  the  variable  name 
referred  to  the  well-known  Schwarz  criterion,  other  criteria  were  also  used  in  our 
implementation,  including  the  Akaike  criterion,  and  a  modified  Schwarz  criterion.  The  other 
components  of  the  SchwarzCriterionVector  class  might  find  use  in  later  implementations,  i.e.,  in 
spectral  analysis. 

The  ModelParameters  class  defined  what  makes  up  a  vector  model.  It  contained  an  integer  for 
the  model  order  (the  optimal  one  in  a  least-squares  sense),  a  Zmat  object  for  the  intercept  vector, 
a  Java  Vector  of  Zmat  objects  for  the  model  coefficient  matrices,  and  another  Zmat  object  for  the 
noise  covariance  matrix.  For  possible  subsequent  extensions,  we  also  decided  to  include  a 
SchwarzCriterionVector  object,  as  defined  above,  and  another  Zmat  object  which  might  be 
useful  in  computing  confidence  intervals  and  spectral  information. 

Note  that  the  MultivariateAutoregressiveModel  class  did  not  contain  any  public  data.  The  actual 
algorithms  used  by  the  vector  AR  modeler  are  given  in  Section  2.4.4. 

For  the  topological  operating  space  Oe,  two  Java  classes  were  designed  that  related  to  the 
ModelParameters  class  according  to  Exhibit  21.  The  TopologicalSpacePoint  class  was  the 
serialized  version  of  the  earlier  ModelParameters  class.  In  theory,  we  could  have  added 
additional  attributes  and  methods  to  ModelParameters  to  handle  the  serialization  chores; 
however,  we  gained  speed  efficiency  in  incorporating  the  new  class,  at  a  small  cost  of  space. 
TopologicalSpacePoint  stored  both  the  state  space  dimension  and  the  model  order,  as  well  as  the 
actual  array  of  doubles  resulting  from  the  serialization  scheme  explained  in  Section  2.3.5.  The 
methods  shown  in  Exhibit  21  are  self-explanatory. 
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Exhibit  21 .  Class  Diagram  for  Topologicai  Space  Oe  and  Modei  Evolver 


The  ModelEvolver  class  takes  care  of  advancing  a  model  in  time,  that  is,  to  predict  values  of 
vector  X,  in  the  notation  of  equation  (28)  of  Section  2.3.5,  at  a  single  time  point  t  or  at  multiple 
time  points  t[].  Such  predictive  values  are  useful  in  monitoring  the  life  of  a  managed  entity,  and 
as  a  means  for  detecting  pre-fault  behavior. 

The  CUSUM  filter  and  vector  AR  modeler  fit  within  the  agent  framework  discussed  in  section 
2.3.6  and  shown  in  Exhibit  22.  If  Jade-based  research  and  development  is  pursued,  then  our 
agent  class  would  inherit  basic  agent  behaviors  from  the  Jade-supplied  Agent  class.  In  addition 
to  this  functionality,  we  would  also  like  to  incorporate  a  rule-based  engine,  such  as  Jess  (the  Java 
Expert  System  Shell,  from  Sandia  Labs),  the  Java  incarnation  of  the  CLIPS  system.  We  also 
learned  that  the  Jade  agent  system  interfaced  with  Jess,  which  might  simplify  some  design 
decisions  if  Jade  were  to  be  incorporated  into  our  software.  In  our  investigation,  a  rule 
corresponded  to  a  pattern  of  events  and  statistic  values  that  constituted  a  network  problem.  At 
that  point,  we  were  trying  to  peruse  the  outputs  of  our  CUSUM  filters. 
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Exhibit  22.  Ciass  Diagram  for  Monitoring  Agents 


2.4.4  The  Implementation  of  the  CUSUM  Filter  and  AR  Model 

The  algorithm  used  by  the  CUSUM  filter  proceeded  as  follows.  First,  the  traffic  data  and 
corresponding  times  in  the  TimeSeriesData  object  passed  to  the  filter()  method  were  appended  to 
what  the  CusumFilter  object  had  already  accumulated  (recall  that  a  filter  object  always  returned 
after  each  change  point,  even  if  its  data  had  not  been  completely  processed).  Next,  we 
remembered  how  deep  we  were  in  an  epoch,  defined  as  a  time  interval  without  change.  Then, 
we  looped  through  the  time  series,  each  iteration  performing  the  following  tasks: 

□  Recomputing  two-sided  CUSUMs  for  the  mean 

□  Recomputing  two-sided  CUSUMs  for  the  variance 

□  Deciding  if  there  were  a  change  point  for  the  mean 

□  Deciding  if  there  were  a  change  point  for  the  variance. 

Finally,  we  made  the  CusumChangeData  object  to  be  returned,  making  sure  to  update  the  data 
held  by  the  CusumFilter  object  and  the  residual  length. 

Let  and  6  be  the  current  mean  value,  and  the  permitted  half  shift  from  the  current  mean  without 
triggering  a  signal.  Let  X(^,U)  and  X(p,L)  be  the  last  mean  upper  CUSUM  and  last  mean  lower 
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CUSUM  computed.  If  the  current  data  value  was  d[i],  then  two  temporary  values  Ti  and  Xa  were 
obtained  as  follows : 

n  =  +  d[i]  -  n  -  8 

t2  =  Ml^L)  -  d[i]  +  ju  -  8 

If  Xi  >  0,  then  indeed  A,(p,U)  was  changed  to  Xi  ,  and  a  running  count  of  how  many  times  the 
upper  mean  CUSUM  had  been  non-zero  was  updated;  a  similar  analysis  was  carried  out  for 
X(p,L)  and  associated  quantities.  The  two-sided  CUSUMs  for  the  variance  were  obtained  in  a 
manner  similar  to  that  used  for  the  mean,  with  the  understanding  that  temporary  CUSUM  values 
for  the  variance  were  computed  before  any  update  to  the  mean  was  actually  carried  out. 

The  decision  for  the  occurrence  of  a  change  point  for  the  mean  and  the  switch  to  a  new  mean 
value  was  based  on  a  comparison  between  A,((X,U)  and  the  accepted  bound  for  the  mean.  If  such 
a  decision  were  made,  then  all  monitors  were  reset  as  well  and  appropriate  new  values  for  mean 
half  shift  and  bound  were  computed.  The  change  point  decision  for  the  variance  was  made  in  a 
similar  fashion.  In  addition,  a  number  of  ancillary  quantities,  such  as  the  residual  length  and  the 
data  accumulated  by  the  filter,  and  their  time  labels,  had  to  be  updated.  Testing  of  the  CUSUM 
filter  was  done  using  both  synthetic  and  real  traffic  data. 

The  algorithms  used  by  the  vector  AR  modeler  proceeded  as  follows.  The  top  level  method, 
modelO  in  Exhibit  20,  simply  made  an  OrderBoundedTimeSeries  object  from  the  raw  traffic  time 
series  slice  and  the  putative  order  range,  then  called  the  computeARmodel()  method,  produced 
the  corresponding  ModelParameters  object.  The  computeARmodelO  method  itself  started  off  by 
computing  a  RegularizedUpperTriangularQR  object  from  the  OrderBoundedTimeSeries  object 
passed  to  it  by  calling  the  computeQRofARQ  method,  evaluating  a  QR  factorization  at  the  highest 
order  specified.  An  order  selection  criterion  vector,  returned  in  a  SchwarzCriterionVector 
object,  was  then  computed  via  the  computeOptimalOrder()  method.  A  search  was  performed  to 
find  the  minimal  entry  of  this  selection  criterion  vector.  An  optimal  order  p  and  an  optimal 
parameter  size  m  were  next  computed  with  the  minimum  order  specified  in  the  call  being  taken 
into  account  this  time. 

The  next  phase  of  the  algorithm  computed  the  intercept  vector  W  and  the  coefficient  matrices  A,-. 
First,  the  R  matrix  from  the  QR  decomposition  was  divided  into  relevant  subblocks.  Second,  an 
augmented  coefficient  matrix  was  obtained  from  a  regularization  vector  that  was  a  byproduct  of 
the  QR  decomposition  and  from  solving  a  linear  system  of  equations.  Next,  the  noise  covariance 
matrix  was  computed  from  scaling  an  appropriate  subblock  of  R.  Finally,  a  few  other  matrices, 
relevant  for  an  eventual  computation  of  spectral  information  and  confidence  intervals,  were 
computed. 

The  computeQRofARO  method  was  almost  like  a  classical  QR  decomposition  except  that  it  was 
not  performed  on  the  original  matrix  resulting  from  the  traffic  time  series.  Instead,  predictors 
and  regulators  were  injected  into  the  matrix.  The  Zqrd  suite  in  Jampack  was  used  to  perform  the 
actual  QR  decomposition. 
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The  computeOptimalOrderO  method  had  the  upper  triangular  matrix  R  of  the  QR  decomposition 
passed  to  it.  Then  it  set  up  a  collection  of  vectors,  including  the  order  selection  criterion  vector 
and  the  noise  covariance  determinant  logarithm  vector,  of  maximum  length  max  -  min  +  1, 
where  max  and  min  denoted  the  putative  largest  and  smallest  model  orders.  The  noise 
covariance  matrix  was  computed  as  the  inverse  of  a  subblock  of  R.  The  essential  part  of 
computeOptimalOrderO  was,  however,  a  backward  iteration  through  the  range  of  possible 
orders,  where  the  noise  covariance  inverse  was  updated  through  Cholesky  decompositions  and 
left  divisions.  Here,  we  noted  that  we  found  the  Jampack  to  be  lacking  the  precision  we  required 
for  Cholesky  decompositions.  We  then  switched  to  the  JAMA  package  for  these 
decompositions,  then  went  back  to  the  Jampack  to  update  the  determinant  logarithm  vector  and 
the  order  selection  vector.  Testing  of  the  vector  AR  modeler  was  done  using  both  synthetic  and 
real  traffic  data. 

Algorithms  for  the  TopologicalSpacePoint  class,  at  this  point  in  time,  were  confined  to 
serialization  and  deserialization  of  vector  AR  models,  and  to  the  computation  of  the  topological 
distance  between  any  two  objects  of  the  class.  The  natural  order  imposed  by  equation  (28)  in 
Section  2.3.5  and  the  structure  of  the  ModelParameters  class  formed  the  basis  of  the 
serialization.  The  topological  distance  was  computed  via  the  Froebenius  norm  of  a  Zmat  object 
in  the  Jampack,  although  it  was  rather  straightforward  to  compute  other  equivalent  norms 
directly. 

Algorithms  for  monitoring  agents  were,  at  that  time  in  the  project,  at  a  very  early  stage  in  their 
development.  An  agent  was  a  Java  serializable  thread  object  whose  capabilities  were  stored  as 
hashmaps  of  languages  and  ontologies.  It  might  be  interesting  to  design  a  new  language  for 
communication  between  network  manager  agents  and  managed  device  agents;  however,  the 
languages  provided  by  the  FIFA  specifications  were  probably  sufficient,  if  enhanced  by  the 
needs  of  SNMP.  A  more  worthwhile  task  lay  in  connecting  both  manager  and  managed  agents 
to  the  rule-based  engine  provided  by  Jess.  In  particular,  a  translation  between  FIPA  message 
structures  and  Jess  facts  should  be  made,  so  that  rule  bases  written  as  CLIPS  files  might  be 
executed  by  an  agent. 


2.4.5  Event  Correlation  Engine 

Work  on  the  event  correlation  engine  and  diagnosis  subsystems  was  begun.  Specifically  we 
looked  at  the  SMARTS  system,  which  is  System  Management  ARTS  Incorporated’s  object- 
oriented  diagnostic  modeling  and  cookbook  correlation.  Instead  of  using  solely  a  rule-based 
system,  SMARTS  uses  what  they  call  a  cookbook.  The  cookbook,  which  is  based  on  an  object- 
oriented  diagnostic  model  (OODM),  is  an  extremely  efficient  mechanism  that  addresses  some  of 
the  limitations  of  rule-based  systems.  Again,  given  our  time  constraints,  we  did  not  pursue  this, 
and  decided  on  the  following. 

The  RTNM’s  diagnostic  capabilities  would  be  derived  from  a  limited  set  of  variables.  These 
variables,  when  combined  with  a  Bayesian  network  and  rule-based  system  would  provide  a 
proactive  management  system  capable  of  not  only  detecting  network  problems  before  they 
became  failures,  but  would  provide  information  as  to  the  cause  of  these  problems.  The  variables 
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used  in  the  diagnosis  subsystem  consisted  of  the  messages  (events)  being  generated  by  the 
various  RTNM  components.  These  events  consisted  namely  of  Java  events,  SNMP  events,  and 
traffic  events. 

Java  events  were  the  error  or  status  messages  generated  from  within  the  Java  language.  Java 
errors  might  occur  when  there  were  network  problems,  database  problems,  or  even  internal 
communications  problems  among  the  various  software  modules. 

SNMP  events  were  messages  relating  to  the  monitoring  of  the  network  traffic  and  devices. 
These  events  might  be  in  response  to  active  SNMP  queries  or  as  a  result  of  network  devices 
experiencing  a  problem.  The  events  generated  by  a  device  were  known  as  traps. 

Traffic  events  occurred  when  there  were  changes  in  traffic  patterns,  high  traffic  loads,  or 
anomalous  traffic  activity.  These  traffic  events  would  be  “fired”  when  one  or  more  of  the 
modeling  subsystems  detected  a  significant  problem  or  change  in  the  network  traffic. 

The  Java  implementation  of  the  diagnostic  subsystems  would  use  the  following  tools. 

□  Rule-based  system  and  fuzzy  logic  rules:  Jess  and  FuzzyJ  add-on. 

□  Bayesian  belief  networks:  JavaBayes  package. 

The  overall  event  diagnosis  /  correlation  architecture  for  the  RTNM  is  shown  in  Exhibit  23. 
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Exhibit  23.  Event  Correiation  Architecture 


Event  Correlation  Architecture  For  RTNM 


2.4.6  Java  Optimization 


Now  that  the  entire  architecture  of  the  application  was  complete,  we  were  able  to  go  back  and 
optimize  some  of  our  algorithms.  Because  Java  is  notoriously  slow  and  memory  hungry,  it  was 
imperative  to  optimize  wherever  we  could.  The  initial  results  were  impressive.  Much  of  the 
RTNM  initialization  and  Graphical  User  Interface  (GUI)  response  times  were  halved.  The 
network  discovery  tool,  while  still  not  perfect,  appeared  to  be  almost  four  times  faster  and 
required  much  less  of  the  system’s  resources. 

The  final  look  and  layout  of  the  RTNM  interface  was  established.  The  incorporation  of  the 
various  RTNM  components  into  one  central  interface  was  achieved  (see  Exhibit  24).  The 
RTNM  now  had  an  uncluttered  top  level  view  of  the  management  tools.  These  tools  included 
network  discovery,  utilization,  prediction,  and  monitoring  for  both  the  Internet  Protocol  (IP)  and 
Interface  layer  as  well  access  to  Ping,  Traceroute,  trap  browser,  and  database  configuration  tools. 
Each  of  these  had  its  unique  drill-down  capabilities  as  shown  in  Exhibits  25,  26,  and  27. 
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Exhibit  26.  RTNM  Node  IP  View 


Current  Utilization  Predicted  Service  Rate  Actual  Service  Rate 


Traffic  Model  Change  Point 


Exhibit  27.  RTNM  Tools 


N  eal-Time  Network  Manager  Portal 

BAE  SYSTEMS 


iisUittJ  |tjdbc:niYsqlJ;quhgon/sn^^^  ^  j;  |qui-goo  j.  jiliim^  ^  |> 

lorlMefi  jorg  gjt  mm  mysql  Driver  '  '  1:,-, C '  ‘  '  P3SSW0f<t  1“““*  Ife 


^-:T¥pe 

IpAddress  varcharQS)  1 _ 

hostName  varch3r(30)  jYES 
:d|^pubComm...  ¥archar(30)  |YES 
^  I  privComm...  varchar(30)  YES 
I  i^iye  char(1)  lYES 

p  isSNMP"  |charO)  YES’ 

I  discoverTt...  timestamp...  YK 


^IpAddress 
rt  92.1 68.1  - 

^92.1 68.1 ... 
192.168.1... 
192.168.1... 
192.1 68.1 ... 
192.168.1... 
191168^7! 

192.168.1.. . 

{92T68TI 

1 92.1 68.1  .■■ 

192.168.1.. ■ 

192.168.1.. . 
192.168.1... 


hostName 
kenseth  \. 


.  reagan _ _ 

.  robinton  ..., 

.router  ...  j 

.jguinness  '...j 

.  rrton _ 

1.  eamhardt  ... _ 

■  jiabonte  j... 

..gemini _ ... 

.jcbewfaaccajir _ 

.  SA1WJAC  !... 


2001-06-14.. 
J2OOI-O6-I4  . 
2001-06-14.. 
2001-06-14.. 
2001-06-14.. 
"2001-06-1 2. ~ 
2001-06-14 


j  1^3.61. 4.1 

11. 3.8.1 .4.1 

1. 3.6.1 .4.1 

1. 3.6.1 .4.1 
1.3.6.1.41 
’1. 3.6.1 .4.1 
'l.3.6.1.4,1 
K3^6^"4/1 

1.3:6. 1.4.1 

IT.3.6.1  .4~.1 
,1.3.6.1.41 

11.3.6.1.4.1 
'1. 3.6.1 .4.1 


.2021 .250.3 _ N 

.311.1.1.3.1.3  jN 
■311.1.1.3.1.3  W 
.9.1.17  |V 

'.31 1.1J. 3.1.1  N 

'311.1.1  JiT'|N' 

JIJTI.1. 3.1.1  N 

.sii.il'oTi . |n 

.31 1.1. 1.3.2  |N 
.311.1.1.3.1.1  jN 
"311.1.1.3.1.1  ]n 
311.1.1.3.1.1  N 


Report  No.  99902-000-A004 


2.4.7  Traffic  Generation  Tools 

In  order  to  provide  an  adequate  demonstration  of  the  capabilities  of  the  RTNM,  it  was  necessary 
to  research  traffic  generation  applications.  These  would  allow  us  to  create  traffic  on  a  network 
and  provide  better  test  and  demonstration  scenarios.  Outlined  below  are  five  free  traffic 
generators  that  we  tested  with  our  application. 

2.4.7. 1  NetPerf 

An  employee  at  Hewlett-Packard  (HP)  designed  NetPerf.  HP  offers  no  protection  to  the  user 
should  something  go  wrong;  however  it  is  (seemingly)  an  effective  tool.  NetPerf  is  a  benchmark 
that  can  be  used  to  measure  the  performance  of  many  different  types  of  networking.  It  provided 
tests  for  both  unidirectional  throughput  and  end-to-end  latency.  The  environments  currently 
measurable  by  NetPerf  include: 

□  TCP  and  UDP  via  BSD  Sockets 

□  DLPI 

□  Fore  ATM  API 

□  HP  HiPPI  Link  Level  Access 

See  URL:  http://www.netperf.org/ 

2. 4.7.2  PSC  TReno  Server 

TReno  is  being  developed  at  Carnegie  Mellon  University  at  the  Pittsburgh  Supercomputing 
Center  (PSC).  The  program  has  several  government  grants  for  work  on  the  project.  TReno  tests 
the  network  under  a  load  similar  to  TCP  and  it  uses  the  same  technique  as  Traceroute.  It  is 
single-threaded  and  therefore  the  user  must  wait  to  test  until  there  is  no  other  test  traffic  on  the 
network.  TReno  is  run  from  the  PSC  server  or  by  downloading  the  software. 

See  URL:  http://www.psc.edu/networking/TReno  info.html 

2.4.7.3  TTCPANDNTTCP 

This  application  was  originally  written  to  move  files  around  but  it  became  the  classic  throughput 
benchmark  or  load  generator.  With  the  addition  of  support  for  sourcing  from  /dev/null  it  has 
spawned  many  variants  including  support  for  DP,  data  pattern  generation,  page  alignment,  and 
even  alignment  offset  control.nttcp  that  allows  mcast  UDP  transfers. 

See  URLs:  ttcp  is  located  at  ftp://ftp.arl.mil/pub/ttcp/ 

nttcp  is  located  at  http://www.leo.org/~elmar/nttcp/ 
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2.4.7. 4  NetPipe 

NetPipe,  written  in  Java,  is  a  protocol-independent  performance  tool.  It  has  many  of  the  same 
qualities  of  ttcp  and  NetPerf. 

See  URL:  http://www.scl.aineslab.gov/netpipe/ 

2.4.7. 5  Chariot 

This  program  also  generates  traffic  to  test  a  LAN.  It  supports  multiple  protocols  but  it  does  not 
seem  to  be  as  robust  as  some  others  found.  Chariot  tests  the  performance  of  networked 
applications,  stress  tests  network  devices,  predicts  application  performance  prior  to  deployment, 
and  troubleshoots  network  performance  problems.  You  can  use  Chariot’s  performance  data  to 
optimize  your  network  and  measure  the  performance  impact  of  proposed  network  changes. 

See  URL:  http://www.netiq.com/Products/Network  Performance/Chariot/Default.asp 


2.4.8  Software  Test  Sites 


In  order  for  us  to  test  our  applications  we  used  four  sites  with  different  topologies  and  networks 
for  testing  and  collecting  data.  The  first  network  on  which  we  performed  collection  was  at  BAE 
SYSTEMS  Rome  Operations.  This  is  a  small,  25-node  network  connected  to  the  Internet.  It  is 
not  heavily  used  and  is  bursty  in  nature. 

The  second  was  the  network  at  the  Air  Force  Research  Laboratory  —  Rome  Research  Site.  This 
network  is  a  large,  1600+  node  network  with  many  subnets  and  multiple  ELANs  and  VLANs. 
Utilization  is  medium  to  heavy  and  it  contains  many  different  types  of  SNMP  nodes  including 
Fore  ATM  switches.  Fore  powerhubs,  hosts,  printers,  and  CD-ROM  servers.  Some  of  the 
powerhubs  contain  over  100  ports,  which  is  a  network  management  challenge  in  itself. 

The  third  site  was  a  university  in  the  northeastern  United  States.  This  university  gave  us 
permission  to  monitor  their  network,  provided  that  we  did  not  publicize  their  identity  or  our 
findings.  This  is  also  a  large  network  with  many  on-line  systems.  The  SNMP  nodes  include  Bay 
Networks  Centillion  ATM  switches  to  HP  bridges  and  a  Cisco  router.  This  network  yielded 
some  interesting  findings,  especially  since  students  used  the  network  around  the  clock. 

The  fourth  site  was  at  the  Air  Force  Research  Laboratory  -  Rome  Research  Site.  It  was  in  the 
Joint  Integration  and  Test  Facility  on  a  classified  network.  We  wanted  to  see  how  RTNM  would 
react  to  a  very  different  and  secure  network. 

Each  test  site  provided  a,  uniqueness  that  stressed  RTNM  in  many  ways.  Some  helped  us  fix 
problems  in  RTNM;  others  provided  ways  to  improve  RTNM.  All  in  all  it  was  excellent  having 
different  test  environments. 
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3.0  APPENDIX 

3.1  RTNM  INSTALLATION  INSTRUCTIONS 


3.1.1  Introduction 

This  document  briefly  describes  what  is  needed  for  installation  and  use  of  RTNM  on  your 
network. 


3.1.2  Software  Required 


The  software  required  for  RTNM  includes; 

□  JDK  1.3 

□  MySQL  version  3.2X 

□  RTNM  software  which  includes  APIs  such  as  JDMK  4.2,  JAMA,  JAMPACK,  and 
MM  MySQL  JDBC  version  0.9 

A  version  of  UCD-SNMP  for  the  PC  is  also  included  for  testing  purposes. 


3.1 .3  Software  Installation  and  Configuration 


Software  installation  includes  MySQL,  JDK,  and  RTNM  software. 

3. 1.3. 1  MySQL  Installation  and  Configuration 

MySQL  is  the  first  application  that  needs  to  be  installed.  Run  the  Setup.exe  file  found  in  the 
MySQL  directory  on  the  CD.  Follow  the  MySQL  installation  instructions.  Once  MySQL  is 
running,  you  need  to  create  a  database,  users,  and  give  access  to  RTNM. 

mysql  -u  root  -p  mysql 
Enter  password: 

Reading  table  information  for  completion  of  table  and  column  names 
You  can  turn  off  this  feature  to  get  a  quicker  startup  with  -A 

Welcome  to  the  MySQL  monitor.  Commands  end  with  ;  or  \g. 

Your  MySQL  connection  id  is  7  to  server  version:  3.22.27 
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Type  ‘help’  for  help. 

mysql>  create  database  snmp  ; 

Query  OK,  1  row  affected  (0.00  sec) 

mysql>  insert  into  user  (Host,User,Password)  values(‘localhost’,‘rtnm’,PASSWORD( 

‘rtnmH’)); 

Query  OK,  1  row  affected  (0.00  sec) 

mysql>  flush  privileges ; 

Query  OK,  0  rows  affected  (0.01  sec) 

mysql>  grant  all  on  snmp.*  to  rtnm  ; 

Query  OK,  0  rows  affected  (0.01  sec) 

3.1. 3.2  JDK  1.3  Installation  AND  Configuration 

Double  click  the  icon  :  j2sdk-l_3_l-win  and  follow  the  instructions. 

3. 1.3.3  RTNM  Installation  and  Configuration 

The  installation  of  RTNM  will  use  the  jar  command  that  will  verify  that  your  path  is  set  up 
correctly.  First  of  all  you  need  to  identify  a  directory  on  which  RTNM  will  be  installed,  i.e., 
C;\RTNM.  Next  run  the  command  jar  -xvf  <drive  letter  of  cdrom>:rtnm.jar  from  the  installation 
directory.  This  installs  RTNM  into  the  directory  and  is  ready  to  run. 

There  is  a  run  batch  file  and  a  run  script  for  Windows  and  UNIX  systems,  respectively.  The 
RTNM  launch  portal  can  be  started  with  “run”  on  Windows  systems  or  “./run”  on  UNDC 
systems. 

Note:  The  run  script  can  be  made  “executable”  on  UNIX  systems  by  issuing  the  chmod 
command  from  the 

RTNM  directory: 

“chmod  755  run”  will  allow  you  to  just  type  “run”. 

Next  select  discover  to  discover  your  network.  Once  the  network  is  discovered,  RTNM  can 
proactively  manage  your  network. 
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API 

AR 

ARMA 

ATM 

CDRL 

CORBA 

CUSUM 

DARPA 

DC 

ELAN 

EMS 

EFT 

FIPA 

FIPA-OS 

GUI 

ED 

IP 

JAMA 

JDBC 

JDMK 

Jess 

JMAPI 

JMX 

MASBF 

MESRA 

MIB 

NIST 

OID 

OMG 


4.0  GLOSSARY 


Application  Programmer’s  Interface 
Autoregressive 

Autoregressive  Moving  Average 
Asynchronous  Transfer  Mode 
Contract  Data  Requirements  List 
Common  Object  Request  Broker  Architecture 
Cumulative  Sum 

Defense  Advanced  Research  Projects  Agency 
Direct  Current 

Emulated  Local  Area  Network 

Equivalent  Markovian  Server 

Fast  Fourier  Transform 

Foundation  for  Intelligent  Physical  Agents 

FIPA  Operating  System 

Graphical  User  Interface 

Independent  Identically  Distributed 

Internet  Protocol 

Java  Matrix  Package 

Java  Data  Base  Connectivity 

Java  Dynamic  Management  Kit 

Java  Expert  System  Shell 

Java  Management  Application  Programmer's  Interface 
Java  Management  Extensions 
Mobil  Agent  Software 

Markov  Equivalent  Server's  Rate  Abstraction 
Management  Information  Base 
National  Institute  of  Standards  and  Technology 
Object  Identification  Number 
Object  Management  Group 
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OODM 

RMI 

RMON 

RTNM 

SBIR 

SMARTS 

SNMP 

SUNYIT 

TCP 

UDP 

UML 

VLAN 


Object-oriented  Diagnostic  Model 

Remote  Method  Invocation 

Remote  Monitoring 

Real-Time  Network  Manager 

Small  Business  Innovative  Research 

System  Management  ARTS 

Simple  Network  Management  Protocol 

State  University  of  New  York  Institute  of  Technology 

Transmission  Control  Protocol 

User  Datagram  Protocol 

Unified  Modeling  Language 

Virtual  Local  Area  Network 


